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Floating-point arithmetic is considered an esotoric subject by many people. This is 
rather surprising, because floating-point is ubiquitous in computer systems: Almost 
every language has a floating-point datatype; computers from PCs to supercomputers 
have floating-point accelerators; most compilers will be called upon to compile 
floating-point algorithms from time to time; and virtually every operating system must 
respond to floating-point exceptions such as overflow This paper presents a tutorial on 
the aspects of floating-point that have a direct impact on designers of computer 
systems. It begins with background on floating-point representation and rounding 
error, continues with a discussion of the IEEE floating-point standard, and concludes 
with examples of how computer system builders can better support floating point. 

Categories and Subject Descriptors: (Primary) C.O [Computer Systems Organization]: 
General —instruction set design ; D.3.4 [Programming Languages]: 

Processors— compilers, optimization ; G.1.0 [Numerical Analysis]: General —computer 
arithmetic, error analysis, numerical algorithms (Secondary) D.2.1 [Software 
Engineering]: Requirements/Specifications— languages; D.3.1 [Programming 
Languages]: Formal Definitions and Theory— semantics D.4.1 [Operating Systems]: 
Process Management— synchronization 

General Terms: Algorithms, Design, Languages 

Additional Key Words and Phrases: denormalized number, exception, floating-point, 
floating-point standard, gradual underflow, guard digit, NaN, overflow, relative error, 
rounding error, rounding mode, ulp, underflow 


INTRODUCTION 

Builders of computer systems often need 
information about floating-point arith¬ 
metic. There are however, remarkably 
few sources of detailed information about 
it. One of the few books on the subject, 
Floating-Point Computation by Pat Ster- 
benz, is long out of print. This paper is a 
tutorial on those aspects of floating-point 
arithmetic ( floating-point hereafter) that 
have a direct connection to systems 
building. It consists of three loosely con¬ 
nected parts. The first (Section 1) dis¬ 
cusses the implications of using different 
rounding strategies for the basic opera¬ 


tions of addition, subtraction, multipli¬ 
cation, and division. It also contains 
background information on the two 
methods of measuring rounding error, 
ulps and relative error. The second part 
discusses the IEEE floating-point stand¬ 
ard, which is becoming rapidly accepted 
by commercial hardware manufacturers. 
Included in the IEEE standard is the 
rounding method for basic operations; 
therefore, the discussion of the standard 
draws on the material in Section 1. The 
third part discusses the connections be¬ 
tween floating point and the design of 
various aspects of computer systems. 
Topics include instruction set design, 
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optimizing compilers, and exception 
handling. 

All the statements made about float¬ 
ing-point are provided with justifications, 
but those explanations not central to the 
main argument are in a section called 
The Details and can be skipped if de¬ 
sired. In particular, the proofs of many of 
the theorems appear in this section. The 
end of each proof is marked with the ■ 
symbol; when a proof is not included, the 
■ appears immediately following the 
statement of the theorem. 

1. ROUNDING ERROR 

Squeezing infinitely many real numbers 
into a finite number of bits requires an 
approximate representation. Although 
there are infinitely many integers, in 
most programs the result of integer com¬ 
putations can be stored in 32 bits. In 
contrast, given any fixed number of bits, 
most calculations with real numbers will 
produce quantities that cannot be exactly 
represented using that many bits. There¬ 
fore, the result of a floating-point calcu¬ 
lation must often be rounded in order to 


fit back into its finite representation. The 
resulting rounding error is the character¬ 
istic feature of floating-point computa¬ 
tion. Section 1.2 describes how it is 
measured. 

Since most floating-point calculations 
have rounding error anyway, does it 
matter if the basic arithmetic operations 
introduce a bit more rounding error than 
necessary? That question is a main theme 
throughout Section 1. Section 1.3 dis¬ 
cusses guard digits, a means of reducing 
the error when subtracting two nearby 
numbers. Guard digits were considered 
sufficiently important by IBM that in 
1968 it added a guard digit to the double 
precision format in the System/360 ar¬ 
chitecture (single precision already had a 
guard digit) and retrofitted all existing 
machines in the field. Two examples are 
given to illustrate the utility of guard 
digits. 

The IEEE standard goes further than 
just requiring the use of a guard digit. It 
gives an algorithm for addition, subtrac¬ 
tion, multiplication, division, and square 
root and requires that implementations 
produce the same result as that algo¬ 
rithm. Thus, when a program is moved 
from one machine to another, the results 
of the basic operations will be the same 
in every bit if both machines support the 
IEEE standard. This greatly simplifies 
the porting of programs. Other uses of 
this precise specification are given in 
Section 1.5. 

2.1 Floating-Point Formats 

Several different representations of real 
numbers have been proposed, but by far 
the most widely used is the floating-point 
representation. 1 Floating-point represen¬ 
tations have a base (S (which is always 
assumed to be even) and a precision p. If 
IS = 10 and p = 3, the number 0.1 is rep¬ 
resented as 1.00 X 10 -1 . If (3 = 2 and 
p = 24, the decimal number 0.1 cannot 


Examples of other representations are floating 
slash and signed logarithm [Matula and Kornerup 
1985; Swartzlander and Alexopoulos 1975], 
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1 00 x 2 2 1 01 x 2 2 1 10 x 2 2 1.11 x 2 2 

I-[lll| I I I |-1-1-I-1-1-t-1 

0 1 2 3 4 5 6 7 

Figure 1. Normalized numbers when (3 = 2, p = 3, e min = — 1, e max = 2. 


be represented exactly but is approxi¬ 
mately 1.10011001100110011001101 x 
2~ 4 . In general, a floating-point num¬ 
ber will be represented as ±d.dd • • ■ d 
X| 8 e , where d.dd ■ ■ ■ d is called the 
significand 2 and has p digits. More pre¬ 
cisely, ±d 0 .d 1 d 2 ••• d p _ 1 x /3 e repre¬ 
sents the number 

±{d 0 + d l r 1 + +d p _ 1 r (p “ 1 ) )r, 

0 <d,<0. (1) 

The term floating-point number will 
be used to mean a real number that can 
be exactly represented in the format un¬ 
der discussion. Two other parameters 
associated with floating-point represen¬ 
tations are the largest and smallest al¬ 
lowable exponents, e max and e mm . Since 
there are 0 P possible significands and 
e max “ e min + 1 possible exponents, a 
floating-point number can be encoded in 
Jl°g 2 ( e max - e min + 1 )] +[log 2 (d P )] + 1 
bits, where the final +1 is for the sign 
bit. The precise encoding is not impor¬ 
tant for now. 

There are two reasons why a real num¬ 
ber might not be exactly representable as 
a floating-point number. The most com¬ 
mon situation is illustrated by the deci¬ 
mal number 0.1. Although it has a finite 
decimal representation, in binary it has 
an infinite repeating representation. 
Thus, when /3 = 2, the number 0.1 lies 
strictly between two floating-point num¬ 
bers and is exactly representable by nei¬ 
ther of them. A less common situation is 
that a real number is out of range; that 
is, its absolute value is larger than li x 


2 This term was introduced by Forsythe and Moler 
[1967] and has generally replaced the older term 
mantissa. 


0 <w or smaller than 1.0 x d eram . Most of 
this paper discusses issues due to the 
first reason. Numbers that are out of 
range will, however, be discussed in Sec¬ 
tions 2.2.2 and 2.2.4. 

Floating-point representations are not 
necessarily unique. For example, both 
0.01 x 10 1 and 1.00 X 10 -1 represent 
0.1 . If the leading digit is nonzero [ d 0 =£ 0 
in eq. ( 1 )], the representation is said to 
be normalized. The floating-point num¬ 
ber 1.00 X 10 _1 is normalized, whereas 
0.01 X 10 1 is not. When f = 2, p = 3, 
e min = -!. and e max = 2 , there are 16 
normalized floating-point numbers, as 
shown in Figure 1. The bold hash marks 
correspond to numbers whose significand 
is 1.00. Requiring that a floating-point 
representation be normalized makes the 
representation unique. Unfortunately, 
this restriction makes it impossible to 
represent zero! A natural way to repre¬ 
sent 0 is with 1.0 x /3 emm_1 , since this 
preserves the fact that the numerical or¬ 
dering of nonnegative real numbers cor¬ 
responds to the lexicographical ordering 
of their floating-point representations . 3 
When the exponent is stored in a k bit 
field, that means that only 2 k — 1 values 
are available for use as exponents, since 
one must be reserved to represent 0 . 

Note that the X in a floating-point 
number is part of the notation and differ¬ 
ent from a floating-point multiply opera¬ 
tion. The meaning of the x symbol 
should be clear from the context. For 
example, the expression (2.5 X 10" 3 ) X 
(4.0 X 10 2 ) involves only a single float¬ 
ing-point multiplication. 


3 This assumes the usual arrangement where the 
exponent is stored to the left of the significand 
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1.2 Relative Error and Ulps 

Since rounding error is inherent in float¬ 
ing-point computation, it is important to 
have a way to measure this error. Con¬ 
sider the floating-point format with (3 = 
10 and p = 3, which will be used 
throughout this section. If the result of a 
floating-point computation is 3.12 x 10 -2 
and the answer when computed to infi¬ 
nite precision is .0314, it is clear that 
this is in error by 2 units in the last 
place. Similarly, if the real number 
.0314159 is represented as 3.14 x 10~ 2 , 
then it is in error by .159 units in the 
last place. In general, if the floating-point 
number d.d ■ ■ ■ d x (3 e is used to repre¬ 
sent 2 , it is in error by | d.d d — 
(z//3 e ) j (3 P ~ 1 units in the last place. 4 The 
term ulps will be used as shorthand for 
“units in the last place.” If the result of 
a calculation is the floating-point num¬ 
ber nearest to the correct result, it still 
might be in error by as much as 1/2 ulp. 

Another way to measure the difference 
between a floating-point number and the 
real number it is approximating is rela¬ 
tive error, which is the difference be¬ 
tween the two numbers divided by the 
real number. For example, the relative 
error committed when approximating 
3.14159 by 3.14 x 10° is .00159/3.14159 
• .0005. 

To compute the relative error that cor¬ 
responds to 1/2 ulp, observe that when a 
real number is approximated by the 
closest possible floating-point number 

p 

d dd ■■■ dd X I3 e , the absolute error can be 

p 

as large as o oo ■ ■ ■ oo/3' x /3 e where 13' is 
the digit /3/2. This error is ((/3/2)/3 ■ p ) x 

£t e . Since numbers of the form cl. chi - 

dd x (3 e all have this same absolute error 
but have values that range between (3 e 
and /3 x /3 e , the relative error ranges be¬ 
tween (((3/2)(3~ p ) x /3 e /f3 e and ((/3/2)/3~ p ) 


4 Unless the number z is larger than /J e max+i or 
smaller than 0“™™. Numbers that are out of range 
in this fashion will not be considered until further 
notice. 


x 07r +1 . That is, 

11/3 

-r p < -ulp < ~r p . ( 2 ) 

In particular, the relative error corre¬ 
sponding to 1/2 ulp can vary by a factor 
of (3. This factor is called the wobble. 
Setting e = (f3/2)/3 p to the largest of 
the bounds in (2), we can say that when a 
real number is rounded to the closest 
floating-point number, the relative error 
is always bounded by e, which is referred 
to as machine epsilon. 

In the example above, the relative er¬ 
ror was .00159/3.14159 * 0005. To avoid 
such small numbers, the relative error is 
normally written as a factor times e, 
which in this case is e = (/3/2)/3~ p = 
5(10) 3 = .005. Thus, the relative error 
would be expressed as ((.00159/ 
3.14159)/.005)6 * O.le. 

To illustrate the difference between 
ulps and relative error, consider the real 
number x = 12.35. It is approximated by 
x = 1.24 x 10’. The error is 0.5 ulps; the 
relative error is 0.8e. Next consider the 
computation 8x. The exact value is 8x = 
98.8, whereas, the computed value is 8x 
= 9.92 x 10 1 . The error is now 4.0 ulps, 
but the relative error is still 0.8e. The 
error measured in ulps is eight times 
larger, even though the relative error is 
the same. In general, when the base is (3, 
a fixed relative error expressed in ulps 
can wobble by a factor of up to /3. Con¬ 
versely, as eq. (2) shows, a fixed error of 
1/2 ulps results in a relative error that 
can wobble by /3. 

The most natural way to measure 
rounding error is in ulps. For example, 
rounding to the nearest floating-point 
number corresponds to 1/2 ulp. When 
analyzing the rounding error caused by 
various formulas, however, relative error 
is a better measure. A good illustration 
of this is the analysis immediately fol¬ 
lowing the proof of Theorem 10. Since e 
can overestimate the effect of rounding 
to the nearest floating-point number by 
the wobble factor of (3, error estimates of 
formulas will be tighter on machines with 
a small /3. 
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When only the order of magnitude of 
rounding error is of interest, ulps and e 
may be used interchangeably since they 
differ by at most a factor of /3. For exam¬ 
ple, when a floating-point number is in 
error by n ulps, that means the number 
of contaminated digits is log, 3 «. If the 
relative error in a computation is ne, 
then 

contaminated digits ~ log^tt. (3) 

1.3 Guard Digits 

One method of computing the difference 
between two floating-point numbers is to 
compute the difference exactly, then 
round it to the nearest floating-point 
number. This is very expensive if the 
operands differ greatly in size. Assuming 
p = 3, 2.15 x 10 12 - 1.25 x 1(T 5 would 
be calculated as 

* = 2.15 X 10 12 

y = .0000000000000000125 x 10 12 

x - y = 2.1499999999999999875 x 10 12 , 

which rounds to 2.15 X 10 12 . Rather than 
using all these digits, floating-point 
hardware normally operates on a fixed 
number of digits. Suppose the number of 
digits kept is p and that when the 
smaller operand is shifted right, digits 
are simply discarded (as opposed to 
rounding). Then, 2.15 X 10 12 - 1.25 X 
10“ 5 becomes 

x = 2.15 x 10 12 
y = 0.00 X 10 12 
x — y — 2.15 x 10 12 . 

The answer is exactly the same as if the 
difference had been computed exactly 
then rounded. Take another example: 
10.1 — 9.93. This becomes 

x = 1.01 x 10 1 
y = 0.99 X 10 1 
x - y = .02 x 10 1 . 

The correct answer is .17, so the com¬ 
puted difference is off by 30 ulps and is 


wrong in every digit! How bad can the 
error be? 

Theorem 1 

Using a floating-point format with pa¬ 
rameters 13 and p and computing differ¬ 
ences using p digits, the relative error of 
the result can he as large as (3 — 1. 

Proof. A relative error of /3 — 1 in 
the expression x — y occurs when x = 
1.00 • • • 0 and y = .qq • • • p, where q = 
(3 - 1. Here y has p digits (all equal to 
q). The exact difference is x — y = (3~ p . 
When computing the answer using only 
p digits, however, the rightmost digit of 
y gets shifted off, so the computed differ¬ 
ence is /3“ p+1 . Thus, the error is /3 p — 
j3~ p+1 = |3~ p (/3 — 1), and the relative er¬ 
ror is (3- p (P - l)//3 -p = j6 - 1. ■ 

When /3 = 2, the absolute error can be 
as large as the result, and when /3 = 10, 
it can be nine times larger. To put it 
another way, when /3 = 2, (3) shows that 
the number of contaminated digits is 
log 2 (l/e) = log 2 (2 p ) = p. That is, all of 
the p digits in the result are wrong! 

Suppose one extra digit is added to 
guard against this situation (a guard 
digit). That is, the smaller number is 
truncated to p + 1 digits, then the result 
of the subtraction is rounded to p digits. 
With a guard digit, the previous example 
becomes 

x = 1.010 x 10 1 
y = 0.993 x 10 1 
x — y = .017 x 10 1 , 

and the answer is exact. With a single 
guard digit, the relative error of the re¬ 
sult may be greater than e, as in 110 — 
8.59: 

x = 1.10 X 10 2 
y = .085 x 10 2 
x - ^ = 1.015 x 10 2 

This rounds to 102, compared with the 
correct answer of 101.41, for a relative 
error of .006, which is greater than 
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e = .005. In general, the relative error of 
the result can be only slightly larger than 
e. More precisely, we have Theorem 2. 

Theorem 2 

If x and y are floating-point numbers in a 
format with (3 and p and if subtraction is 
done with p + 1 digits ( i.e ., one guard 
digit), then the relative rounding error in 
the result is less than 2e. 

This theorem will be proven in Section 
4.1. Addition is included in the above 
theorem since x and y can be positive 
or negative. 

1.4 Cancellation 

Section 1.3 can be summarized by saying 
that without a guard digit, the relative 
error committed when subtracting two 
nearby quantities can be very large. In 
other words, the evaluation of any ex¬ 
pression containing a subtraction (or an 
addition of quantities with opposite signs) 
could result in a relative error so large 
that all the digits are meaningless (The¬ 
orem 1). When subtracting nearby quan¬ 
tities, the most significant digits in the 
operands match and cancel each other. 
There are two kinds of cancellation: 
catastrophic and benign. 

Catastrophic cancellation occurs when 
the operands are subject to rounding er¬ 
rors. For example, in the quadratic for¬ 
mula, the expression b 2 - 4ac occurs. 
The quantities b 2 and 4ac are subject to 
rounding errors since they are the re¬ 
sults of floating-point multiplications. 
Suppose they are rounded to the nearest 
floating-point number and so are accu¬ 
rate to within 1/2 ulp. When they are 
subtracted, cancellation can cause many 
of the accurate digits to disappear, leav¬ 
ing behind mainly digits contaminated 
by rounding error. Hence the difference 
might have an error of many ulps. For 
example, consider b = 3.34, a = 1.22, 
and c = 2.28. The exact value of b 2 — 
4ac is .0292. But 6 2 rounds to 11.2 and 
4ac rounds to 11.1, hence the final an¬ 
swer is .1, which is an error by 70 ulps 
even though 11.2 - 11.1 is exactly equal 


to .1. The subtraction did not introduce 
any error but rather exposed the error 
introduced in the earlier multiplications. 

Benign cancellation occurs when sub¬ 
tracting exactly known quantities. If x 
and y have no rounding error, then by 
Theorem 2 if the subtraction is done with 
a guard digit, the difference x — y has a 
very small relative error (less than 2 e). 

A formula that exhibits catastrophic 
cancellation can sometimes be rear¬ 
ranged to eliminate the problem. Again 
consider the quadratic formula 


r i 


r 2 


— b+ Vb 2 — 4 ac 

2 a 

— b — Vb 2 — 4ac 

2 a 


(4) 


When b 2 > ac, then b 2 — 4 ac does n ot 
involve a cancellation and \/b 2 - 4ac ~ 

| b |. But the other addition (subtraction) 
in one of the formulas will have a catas¬ 
trophic cancellation. To avoid this, mul¬ 
tiply the nume rator an d denominator of 
rj by -b - \'b 2 -4 ac (and similarly 
for r 2 ) to obtain 

2c 

— b - Vb 2 — 4 ac 


2 -b+v'b^A^c' V ' 

If b 2 > ac and b > 0, then computing r 1 
using formula (4) will involve a cancella¬ 
tion. Therefore, use (5) for computing r x 
and (4) for r 2 . On the other hand, if 
b < 0, use (4) for computing r x and (5) 
for r 2 . 

The expression % 2 — y 2 is another for¬ 
mula that exhibits catastrophic cancella¬ 
tion. It is more accurate to evaluate it as 
(x — y)(x + y). 5 Unlike the quadratic 


Although the expression (x - y)( x + v) does not 
cause a catastrophic cancellation, it is slightly less 
accurate than x 2 — y 2 if x t> y or x < y. In this 
case, (x — y)( x + y) has three rounding errors, but 
x 2 - y 2 has only two since the rounding error com¬ 
mitted when computing the smaller of x 2 and y 2 
does not affect the final subtraction. 
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formula, this improved form still has a 
subtraction, but it is a benign cancella¬ 
tion of quantities without rounding er¬ 
ror, not a catastrophic one. By Theorem 
2, the relative error in x — y is at most 
2e. The same is true of x + y. Multiply¬ 
ing two quantities with a small relative 
error results in a product with a small 
relative error (see Section 4.1). 

To avoid confusion between exact and 
computed values, the following notation 
is used. Whereas x — y denotes the exact 
difference of x and y, x 0 y denotes the 
computed difference (i.e., with rounding 
error). Similarly ffi, 0, and 0 denote 
computed addition, multiplication, and 
division, respectively. All caps indicate 
the computed value of a function, as in 
LN( x) or SQRT( x). Lowercase functions 
and traditional mathematical notation 
denote their exact values as in ln( x) 
and \fx. 

Although ( x 0y) 0 {x © y) is an ex¬ 
cellent approximation of x 2 — y 2 , the 
floating-point numbers x and y might 
themselves be approximations to some 
true quantities x and y. For example, x 
and y might be exactly known decimal 
numbers that cannot be expressed ex¬ 
actly in binary. In this case, even though 
x © y is a good approximation to x — y, 
it can have a huge relative error com¬ 
pared to the true expression x — y, and 
so the advantage of ( x + y)( x — y) over 
x 2 — y 2 is not as dramatic. Since comput¬ 
ing ( x + y)( x - y) is about the same 
amount of work as computing x 2 — y 2 , it 
is clearly the preferred form in this case. 
In general, however, replacing a catas¬ 
trophic cancellation by a benign one is 
not worthwhile if the expense is large 
because the input is often (but not al¬ 
ways) an approximation. But eliminat¬ 
ing a cancellation entirely (as in the 
quadratic formula) is worthwhile even if 
the data are not exact. Throughout this 
paper, it will be assumed that the float¬ 
ing-point inputs to an algorithm are ex¬ 
act and that the i-esnlts are cempvited as 

accurately as possible. 

The expression x 2 — y 2 is more accu¬ 
rate when rewritten as (x — y)(x + y) 
because a catastrophic cancellation is 


replaced with a benign one. We next pre¬ 
sent more interesting examples of formu¬ 
las exhibiting catastrophic cancellation 
that can be rewritten to exhibit only 
benign cancellation. 

The area of a triangle can be expressed 
directly in terms of the lengths of its 
sides a, b, and c as 

A = \/s(s — a)(s — b)(s — c) , 
a + b + c 

where s = -—-. (6) 

Suppose the triangle is very flat; that is, 
a = b + c. Then s * a, and the term 
(s — a) in eq. (6) subtracts two nearby 
numbers, one of which may have round¬ 
ing error. For example, if a = 9.0, b = c 
= 4.53, then the correct value of s is 
9.03 and A is 2.34. Even though the 
computed value of s (9.05) is in error by 
only 2 ulps, the computed value of A is 
3.04, an error of 60 ulps. 

There is a way to rewrite formula (6) 
so that it will return accurate results 
even for flat triangles [Kahan 1986], It is 

A = [(la + (6 + c))(c- (a - b)) 

x (c + (a - 6))(a + (6 - c))] /2 /4, 

a > b > c. (7) 

If a, b, and c do not satisfy a > b > c, 
simply rename them before applying (7). 
It is straightforward to check that the 
right-hand sides of (6) and (7) are alge¬ 
braically identical. Using the values of 
a, b, and c above gives a computed area 
of 2.35, which is 1 ulp in error and much 
more accurate than the first formula. 

Although formula (7) is much more 
accurate than (6) for this example, it 
would be nice to know how well (7) per¬ 
forms in general. 

Theorem 3 

The rounding error incurred when using 

(7) to compute the area of a triangle is at 

most lie, provided subtraction is per¬ 
formed with a guard digit, e < .005, and 
square roots are computed to within 1/2 
ulp. 
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The condition that e < .005 is met in 
virtually every actual floating-point sys¬ 
tem. For example, when j3 = 2, p > 8 
ensures that e < .005, and when (3 = 10, 
p > 3 is enough. 

In statements like Theorem 3 that dis¬ 
cuss the relative error of an expression, 
it is understood that the expression is 
computed using floating-point arith¬ 
metic. In particular, the relative error is 
actually of the expression 

(SQRT(a ®(b ®c)) ®(c0(a 0b)) 
0(c ®(a 0b)) 0 (a ®(b 0c))j 
0 4. (8) 

Because of the cumbersome nature of (8), 
in the statement of theorems we will 
usually say the computed value of E 
rather than writing out E with circle 
notation. 

Error bounds are usually too pes¬ 
simistic. In the numerical example given 
above, the computed value of (7) is 2.35, 
compared with a true value of 2.34216 
for a relative error of 0.7e, which is much 
less than lie. The main reason for com¬ 
puting error bounds is not to get precise 
bounds but rather to verify that the 
formula does not contain numerical 
problems. 

A final example of an expression that 
can be rewritten to use benign cancella¬ 
tion is (1 + x) n , where x < 1. This ex¬ 
pression arises in financial calculations. 
Consider depositing $100 every day into 
a bank account that earns an annual 
interest rate of 6%, compounded daily. If 
n = 365 and i = .06, the amount of 
money accumulated at the end of one 
year is 100[(1 + i / n) n — 1 ]/(i/n) dol¬ 
lars. If this is computed using /3 = 2 and 
p = 24, the result is $37615.45 compared 
to the exact answer of $37614.05, a 
discrepancy of $1.40. The reason for 
the problem is easy to see. The expres¬ 
sion 1 + i/n involves adding 1 to 
.0001643836, so the low order bits of i/n 
are lost. This rounding error is amplified 
when 1 + i/n is raised to the nth power. 


The troublesome expression (1 + i / ri) n 
can be rewritten as exp[ n ln(l + i / n)], 
where now the problem is to compute 
ln(l + x) for small x. One approach is to 
use the approximation ln(l + x) ® x, in 
which case the payment becomes 
$37617.26, which is off by $3.21 and even 
less accurate than the obvious formula. 
But there is a way to compute ln(l + x) 
accurately, as Theorem 4 shows 
[Hewlett-Packard 1982]. This formula 
yields $37614.07, accurate to within 2 
cents! 

Theorem 4 assumes that LN(x) ap¬ 
proximates ln(x) to within 1/2 ulp. The 
problem it solves is that when x is small, 
LN(1 © x) is not close to ln(l + x) be¬ 
cause 1 © x has lost the information in 
the low order bits of x. That is, the com¬ 
puted value of ln(l + x) is not close to its 
actual value when 1. 


Theorem 4 

If ln(l - x) is computed using the for¬ 
mula 


ln(l + x) 


x 


x ln(l + x) 
( 1 +*) - 1 


for 1 © x = 1 
for 1 © x ^ 1 


the relative error is at most 5e when 0 < 
x < 3/4, provided subtraction is per¬ 
formed with a guard digit, e < 0.1, and 
In is computed to within 1/2 ulp. 


This formula will work for any value of 
x but is only interesting for x 1, which 
is where catastrophic cancellation occurs 
in the naive formula Inti + x). Although 
the formula may seem mysterious, there 
is a simple explanation for why it works. 
Write ln(l + x) as x[ln(l + x)/x] = 
x/x(x). The left-hand factor can be com¬ 
puted exactly, but the right-hand factor 
g(x) = ln(l + x)/x will suffer a large 
rounding error when adding 1 to x. How¬ 
ever, p is almost constant, since ln(l + 
x) ~ x. So changing x slightly will not 
introduce much error. In other words, if 
x ~ x, computing xp(5c) will be a good 
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approximation to xp{x) = ln(l + x). Is 
there a value for x for which x and 
x + 1 can be computed accurately? There 
is; namely, x = (1 © x) © 1, because 
then 1 + x is exactly equal to 1 © x. 

The results of this section can be sum¬ 
marized by saying that a guard digit 
guarantees accuracy when nearby pre¬ 
cisely known quantities are subtracted 
(benign cancellation). Sometimes a for¬ 
mula that gives inaccurate results can be 
rewritten to have much higher numeri¬ 
cal accuracy by using benign cancella¬ 
tion; however, the procedure only works 
if subtraction is performed using a guard 
digit. The price of a guard digit is not 
high because is merely requires making 
the adder 1 bit wider. For a 54 bit double 
precision adder, the additional cost is less 
than 2%. For this price, you gain the 
ability to run many algorithms such as 
formula (6) for computing the area of a 
triangle and the expression in Theorem 4 
for computing ln(l + x). Although most 
modern computers have a guard digit, 
there are a few (such as Crays) that 
do not. 

1.5 Exactly Rounded Operations 

When floating-point operations are done 
with a guard digit, they are not as accu¬ 
rate as if they were computed exactly 
then rounded to the nearest floating-point 
number. Operations performed in this 
manner will be called exactly rounded. 
The example immediately preceding 
Theorem 2 shows that a single guard 
digit will not always give exactly rounded 
results. Section 1.4 gave several exam¬ 
ples of algorithms that require a guard 
digit in order to work properly. This sec¬ 
tion gives examples of algorithms that 
require exact rounding. 

So far, the definition of rounding has 
not been given. Rounding is straightfor¬ 
ward, with the exception of how to round 
halfway cases; for example, should 12.5 
round to 12 or 13? One school of thought 
divides the 10 digits in half, letting 
{0,1,2,3,4} round down and {5,6,7,8,9} 
round up; thus 12.5 would round to 13. 
This is how rounding works on Digital 


Equipment Corporation’s VAX 6 comput¬ 
ers. Another school of thought says that 
since numbers ending in 5 are halfway 
between two possible roundings, they 
should round down half the time and 
round up the other half. One way of ob¬ 
taining this 50% behavior is to require 
that the rounded result have its least 
significant digit be even. Thus 12.5 
rounds to 12 rather than 13 because 2 is 
even. Which of these methods is best, 
round up or round to even? Reiser and 
Knuth [1975] offer the following reason 
for preferring round to even. 

Theorem 5 

Let x and y be floating-point numbers, 
and define x 0 = x, x x = (x 0 © y) © 
y, - - -, x n = (x n _ x Gy) Gy. If © and 
© are exactly rounded using round to 
even, then either x n = xforallnorx n = x x 
for all n > 1. S® 

To clarify this result, consider (3 = 10, 
p = 3 and let x = 1.00, y = -.555. 
When rounding up, the sequence be¬ 
comes x 0 © y = 1.56, x x = 1.56 © .555 
= 1.01, x : © y = 1.01 © .555 = 1.57, 
and each successive value of x n in¬ 
creases by .01. Under round to even, x n 
is always 1.00. This example suggests 
that when using the round up rule, com¬ 
putations can gradually drift upward, 
whereas when using round to even the 
theorem says this cannot happen. 
Throughout the rest of this paper, round 
to even will be used. 

One application of exact rounding oc¬ 
curs in multiple precision arithmetic. 
There are two basic approaches to higher 
precision. One approach represents float¬ 
ing-point numbers using a very large sig- 
nificand, which is stored in an array of 
words, and codes the routines for manip¬ 
ulating these numbers in assembly lan¬ 
guage. The second approach represents 
higher precision floating-point numbers 
as an array of ordinary floating-point 


6 VAX is a trademark of Digital Equipment 
Corporation. 
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numbers, where adding the elements of 
the array in infinite precision recovers 
the high precision floating-point number. 
It is this second approach that will be 
discussed here. The advantage of using 
an array of floating-point numbers is that 
it can be coded portably in a high-level 
language, but it requires exactly rounded 
arithmetic. 

The key to multiplication in this sys¬ 
tem is representing a product xy as a 
sum, where each summand has the same 
precision as x and y. This can be done 
by splitting x and y. Writing x = x h + x t 
and y = y h + y t , the exact product is xy 
= x h y h + x h yi + x l y h + x t y v If x and y 
have p bit significands, the summands 
will also have p bit significands, pro¬ 
vided x u x h , y h , y t can be represented 
using [ p/ 2J bits. When p is even, it is 
easy to find a splitting. The number 
x 0 .x 1 • • • x p _ 1 can be written as the sum 
of x 0 .x 1 ■■■ x p/2 _ 1 and 0.0 ■ • • 0x p , 2 
x p _ 1 . When p is odd, this simple 
splitting method will not work. An extra 
bit can, however, be gained by using neg¬ 
ative numbers. For example, if 0 = 2, 
p = 5, and x = .10111, x can be split as 
x h = .11 and X[ = - .00001. There is 
more than one way to split a number. A 
splitting method that is easy to compute 
is due to Dekker [1971], but it requires 
more than a single guard digit. 

Theorem 6 

Let p be the floating-point precision, with 
the restriction that p is even when 0 > 2, 
and assume that floating-point operations 
are exactly rounded. Then ifk= \ p/2] is 
half the precision (rounded up) and m = 
0 k + 1, x can be split as x = x h + x { , 
where x h = (m x) © (m ® x © x), x t 
= x © x h , and each x x is representable 
using [ p/2\ bits of precision. 

To see how this theorem works in an 
example, let 0 = 10, p = 4, b = 3.476, 
a = 3.463, and c = 3.479. Then b 2 - ac 
rounded to the nearest floating-point 
number is .03480, while b® b = 12.08, 
a <8> c = 12.05, and so the computed value 
of 6 2 — ac is .03. This is an error of 480 


ulps. Using Theorem 6 to write b = 3.5 
- .024, a = 3.5 — .037, and c = 3.5 — 
.021, b 2 becomes 3.5 2 - 2 X 3.5 x .024 
+ .024 2 . Each summand is exact, so b 2 
= 12.25 — .168 + .000576, where the 
sum is left unevaluated at this point. 
Similarly, 

ac = 3.5 2 - (3.5 x .037 + 3.5 x .021) 

+ .037 x .021 

= 12.25 - .2030 + .000777. 

Finally, subtracting these two series term 
by term gives an estimate for b 2 - ac of 
0 © .0350 © .04685 = .03480, which is 
identical to the exactly rounded result. 
To show that Theorem 6 really requires 
exact rounding, consider p = 3, 0 = 2, 
and x = 7. Then m = 5, mx = 35, and 
m <S> x = 32. If subtraction is performed 
with a single guard digit, then (m<S>x) 
© x = 28. Therefore, x h = 4 and x l = 3, 
hence x l not representable with [p/ 2J = 
1 bit. 

As a final example of exact rounding, 
consider dividing m by 10. The result is 
a floating-point number that will in gen¬ 
eral not be equal to m/10. When 0 = 2, 
however, multiplying m 010 by 10 will 
miraculously restore m, provided exact 
rounding is being used. Actually, a more 
general fact (due to Kahan) is true. The 
proof is ingenious, but readers not inter¬ 
ested in such details can skip ahead to 
Section 2. 

Theorem 7 

When 0 = 2, if m and n are integers with 
I m | < 2 P_1 and n has the special form 
n = 2' + 2 J , then (m 0 n) <S> n - m, 
provided floating-point operations are 
exactly rounded. 

Proof. Scaling by a power of 2 is 
harmless, since it changes only the expo¬ 
nent not the significand. If q = m/n, 
then scale n so that 2 P_1 < n < 2 P and 
scale m so that 1/2 < q < 1. Thus, 2 P ~ 2 
< m < 2 P . Since m has p significant 
bits, it has at most 1 bit to the right of 
the binary point. Changing the sign of m 
is harmless, so assume q > 0. 
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If q = m 0 n, to prove the theorem 
requires showing that 

I nq - m | < - . (9) 

That is because m has at most 1 bit right 
of the binary point, so nq will round to 
m. To deal with the halfway case when 
| nq - m | =1/4, note that since the ini¬ 
tial unsealed m had |m| <2 P ~ 1 , its 
low-order bit was 0, so the low-order bit 
of the scaled m is also 0. Thus, halfway 
cases will round to m. 

Suppose q = . q x q 2 ‘'' , and let q = 
(hQi •" <7 p l- To estimate \nq-m\, 
first compute \q - q\ = | N/2 P+1 - 
m/n\, where N is an odd integer. 
Since n = 2 l + 2 J and 2 p - 1 <n<2 p , 
it must be that n = 2 P 1 + 2 k for some 
k < p - 2, and thus 

I nN - 2 p + 1 m I 


(2 P- l ~ k + 1 )N - 2 p+1 - k m 
n2 p+1 ~ k ' 

The numerator is an integer, and since 
N is odd, it is in fact an odd integer. 
Thus, | q - q | > l/(n2 p+1- *). Assume 
q < q (the case q > q is similar). Then 
nq < m, and 

| m - nq\ = m - nq = n(q - q) 

= n(q- (q-2- p ~ 1 )) 


= ( 2 p - 1 + 2*)2- p - 1 -I- 2 -p- x +*= i. 

This establishes (9) and proves the theo¬ 
rem. ■ 

The theorem holds true for any base 13, 
as long as 2 1 + 2 J is replaced by (3 l + j3 J . 
As B gets larger, however, there are 
fewer and fewer denominators of the 
form B l + P J . 

We are now in a position to answer the 
question, Does it matter if the basic 


arithmetic operations introduce a little 
more rounding error than necessary? The 
answer is that it does matter, because 
accurate basic operations enable us to 
prove that formulas are “correct” in the 
sense they have a small relative error. 
Section 1.4 discussed several algorithms 
that require guard digits to produce cor¬ 
rect results in this sense. If the input to 
those formulas are numbers representing 
imprecise measurements, however, the 
bounds of Theorems 3 and 4 become less 
interesting. The reason is that the be¬ 
nign cancellation x — y can become 
catastrophic if x and y are only approxi¬ 
mations to some measured quantity. But 
accurate operations are useful even in 
the face of inexact data, because they 
enable us to establish exact relationships 
like those discussed in Theorems 6 and 7. 
These are useful even if every floating¬ 
point variable is only an approximation 
to some actual value. 

2. IEEE STANDARD 

There are two different IEEE standards 
for floating-point computation. IEEE 754 
is a binary standard that requires B = 2, 
p ■= 24 for single precision and p = 53 
for double precision [IEEE 1987]. It also 
specifies the precise layout of bits in a 
single and double precision. IEEE 854 
allows either (3 = 2 or p = 10 and unlike 
754, does not specify how floating-point 
numbers are encoded into bits [Cody et 
al. 1984], It does not require a particular 
value for p, but instead it specifies con¬ 
straints on the allowable values of p for 
single and double precision. The term 
IEEE Standard will be used when dis¬ 
cussing properties common to both 
standards. 

This section provides a tour of the IEEE 
standard. Each subsection discusses one 
aspect of the standard and why it was 
included. It is not the purpose of this 
paper to argue that the IEEE standard is 
the best possible floating-point standard 
but rather to accept the standard as given 
and provide an introduction to its use. 
For full details consult the standards 
[Cody et al. 1984; Cody 1988; IEEE 1987]. 
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2.1 Formats and Operations 

2.1.1 Base 

It is clear why IEEE 854 allows /3 = 10. 
Base 10 is how humans exchange and 
think about numbers. Using /3 = 10 is 
especially appropriate for calculators, 
where the result of each operation is dis¬ 
played by the calculator in decimal. 

There are several reasons wky IEEE 
854 requires that if the base is not 10, it 
must be 2. Section 1.2 mentioned one 
reason: The results of error analyses are 
much tighter when (3 is 2 because a 
rounding error of 1/2 ulp wobbles by a 
factor of j3 when computed as a relative 
error, and error analyses are almost al¬ 
ways simpler when based on relative er¬ 
ror. A related reason has to do with the 
effective precision for large bases. Con¬ 
sider /3 = 16, p — 1 compared to <3 = 2, 
p = 4. Both systems have 4 bits of signif- 
icand. Consider the computation of 15/8. 
When 5 = 2, 15 is represented as 1.111 
x 2 3 and 15/8 as 1.111 x 2°. So 15/8 is 
exact. When f3 = 16, however, 15 is rep¬ 
resented as F X 16°, where F is the hex¬ 
adecimal digit for 15. But 15/8 is repre¬ 
sented as 1 x 16°, which has only 1 bit 
correct. In general, base 16 can lose up to 
3 bits, so a precision of p can have an 
effective precision as low as 4p - 3 
rather than 4 p. 

Since large values of /3 have these 
problems, why did IBM choose (3 = 16 for 
its system/370? Only IBM knows for sure, 
but there are two possible reasons. The 
first is increased exponent range. Single 
precision on the system/370 has /3 = 16, 
p = 6. Hence the significand requires 24 
bits. Since this must fit into 32 bits, this 

leaves 7 bits for the exponent and 1 for 
the sign bit. Thus, the magnitude of rep¬ 
resentable numbers ranges from about 

16“ 2 to about 16 2 ’ = 2 28 . To get a simi¬ 
lar exponent range when /3 = 2 would 
require 9 bits of exponent, leaving only 
22 bits for the significand. It was just 
pointed out, however, that when /? = 16, 
the effective precision can be as low as 
4p - 3 = 21 bits. Even worse, when j8 = 
2 it is possible to gain an extra bit of 


precision (as explained later in this sec¬ 
tion), so the /3 = 2 machine has 23 bits of 
precision to compare with a range of 
21-24 bits for the /? = 16 machine. 

Another possible explanation for 
choosing = 16 bits has to do with shift¬ 
ing. When adding two floating-point 
numbers, if their exponents are different, 
one of the significands will have to be 
shifted to make the radix points line up, 
slowing down the operation. In the /3 = 
16, p — 1 system, all the numbers be¬ 
tween 1 and 15 have the same exponent, 
so no shifting is required when adding 

any of the |J = 105 possible pairs of 

distinct numbers from this set. In the 
j3 = 2, p = 4 system, however, these 
numbers have exponents ranging from 0 
to 3, and shifting is required for 70 of the 
105 pairs. 

In most modern hardware, the perform¬ 
ance gained by avoiding a shift for a 
subset of operands is negligible, so the 
small wobble of /3 = 2 makes it the 
preferable base. Another advantage of 
using /3 = 2 is that there is a way to gain 
an extra bit of significance. 7 Since float¬ 
ing-point numbers are always normal¬ 
ized, the most significant bit of the 
significand is always 1, and there is no 
reason to waste a bit of storage repre¬ 
senting it. Formats that use this trick 
are said to have a hidden bit. It was 
pointed out in Section 1.1 that this re¬ 
quires a special convention for 0. The 
method given there was that an expo¬ 
nent of e mm - 1 and a significand of all 
zeros represent not 1.0 X 2 e “ m ' 1 but 
rather 0. 

IEEE 754 single precision is encoded 
in 32 bits using 1 bit for the sign, 8 bits 
for the exponent, and 23 bits for the sig¬ 
nificand. It uses a hidden bit, however, 
so the significand is 24 bits (p = 24), 
even though it is encoded using only 
23 bits. 


' This appears to have first been published by Gold¬ 
berg [1967], although Knuth [1981 page 211] at¬ 
tributes this idea to Konrad Zuse. 
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2 . 7.2 Precision 

The IEEE standard defines four different 
precisions: single, double, single ex¬ 
tended, and double extended. In 754, sin¬ 
gle and double precision correspond 
roughly to what most floating-point 
hardware provides. Single precision oc¬ 
cupies a single 32 bit word, double preci¬ 
sion two consecutive 32 bit words. 
Extended precision is a format that offers 
just a little extra precision and exponent 
range (Table 1). The IEEE standard only 
specifies a lower bound on how many 
extra bits extended precision provides. 
The minimum allowable double-extended 
format is sometimes referred to as 80 -bit 
format, even though the table shows it 
using 79 bits. The reason is that hard¬ 
ware implementations of extended preci¬ 
sion normally do not use a hidden bit and 
so would use 80 rather than 79 bits. 8 

The standard puts the most emphasis 
on extended precision, making no recom¬ 
mendation concerning double precision 
but strongly recommending that 

Implementations should support the extended 

format corresponding to the widest basic format 

supported,. . . 

One motivation for extended precision 
comes from calculators, which will often 
display 10 digits but use 13 digits inter¬ 
nally. By displaying only 10 of the 13 
digits, the calculator appears to the user 
f : a black box that computes exponen¬ 
tials, cosines, and so on, to 10 digits of 
accuracy. For the calculator to compute 
functions like exp, log, and cos to within 
10 digits with reasonable efficiency, how¬ 
ever, it needs a few extra digits with 
which to work. It is not hard to find a 
simple rational expression that approxi¬ 
mates log with an error of 500 units in 
the last place. Thus, computing with 13 
digits gives an answer correct to 10 dig¬ 
its. By keeping these extra 3 digits hid- 


a According to Kahan, extended precision has 64 
bits of significancl because that was the widest 
precision across which carry propagation could be 
done on the Intel 8087 without increasing the cycle 
time [Kahan 1988]. 


den, the calculator presents a simple 
model to the operator. 

Extended precision in the IEEE stand¬ 
ard serves a similar function. It enables 
libraries to compute quantities to within 
about 1/2 ulp in single (or double) preci¬ 
sion efficiently, giving the user of those 
libraries a simple model, namely, that 
each primitive operation, be it a simple 
multiply or an invocation of log, returns 
a value accurate to within about 1/2 ulp. 
When using extended precision, however, 
it is important to make sure that its use 
is transparent to the user. For example, 
on a calculator, if the internal represen¬ 
tation of a displayed value is not rounded 
to the same precision as the display, the 
result of further operations will depend 
on the hidden digits and appear unpre¬ 
dictable to the user. 

To illustrate extended precision fur¬ 
ther, consider the problem of converting 
between IEEE 754 single precision and 
decimal. Ideally, single precision num¬ 
bers will be printed with enough digits so 
that when the decimal number is read 
back in, the single precision number can 
be recovered. It turns out that 9 decimal 
digits are enough to recover a single pre¬ 
cision binary number (see Section 4.2). 
When converting a decimal number back 
to its unique binary representation, a 
rounding error as small as 1 ulp is fatal 
because it will give the wrong answer. 
Here is a situation where extended preci¬ 
sion is vital for an efficient algorithm. 
When single extended is available, a 
straightforward method exists for con¬ 
verting a decimal number to a single 
precision binary one. First, read in the 9 
decimal digits as an integer N, ignoring 
the decimal point. From Table 1, p > 32, 
and since 10 9 < 2 32 « 4.3 X 10 9 , N can 
be represented exactly in single ex¬ 
tended. Next, find the appropriate power 
10 p necessary to scale N. This will be a 
combination of the exponent of the deci¬ 
mal number, and the position of the 
(up until now) ignored decimal point. 
Compute 10 1 p| . If | P | <13, this is also 
represented exactly, because 10 13 = 
2 13 5 13 and 5 13 < 2 32 . Finally, multiply 
(or divide if P < 0) N and 10 1 p '. If this 
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Table 1. IEEE 754 Format Parameters 


Parameter 


Format 


Single 

Single Extended 

Double 

Double Extended 

P 

24 

> 32 

53 

> 64 

P 

+ 127 

> +1023 

+ 1023 

> +16383 

P 

-126 

< -1022 

-1022 

< -16382 

Exponent width in bits 

8 

> 11 

11 

> 15 

Format width in bits 

32 

> 43 

64 

> 79 


last operation is done exactly, the closest 
binary number is recovered. Section 4.2 
shows how to do the last multiply (or 
divide) exactly. Thus, for | P | < 13, the 
use of the single-extended format enables 
9 digit decimal numbers to be converted 
to the closest binary number (i.e., ex¬ 
actly rounded). If j P | > 13, single- 
extended is not enough for the above 
algorithm to compute the exactly rounded 
binary equivalent always, but Coonen 
[1984] shows that it is enough to guaran¬ 
tee that the conversion of binary to deci¬ 
mal and back will recover the original 
binary number. 

If double precision is supported, the 
algorithm above would run in double 
precision rather than single-extended, 
but to convert double precision to a 17 
digit decimal number and back would 
require the double-extended format. 

2.1.3 Exponent 

Since the exponent can be positive or 
negative, some method must be chosen to 
represent its sign. Two common methods 
of representing signed numbers are 
sign/magnitude and two’s complement. 
Sign/magnitude is the system used for 
the sign of the significand in the IEEE 
formats: 1 bit is used to hold the sign; the 
rest of the bits represent the magnitude 
of the number. The two’s complement 
representation is often used in integer 
arithmetic. In this scheme, a number 
is represented by the smallest nonneg¬ 
ative number that is congruent to it 
modulo 2 p . 

The IEEE binary standard does not 
use either of these methods to represent 
the exponent but instead uses a biased 


representation. In the case of single pre¬ 
cision, where the exponent is stored in 8 
bits, the bias is 127 (for double precision 
it is 1023). What this means is that if k 
is the value of the exponent bits inter¬ 
preted as an unsigned integer, then the 
exponent of the floating-point number is 
k - 127. This is often called the biased 
exponent to distinguish from the unbi¬ 
ased exponent k. An advantage of biased 
representation is that nonnegative flout - 
ing-point numbers can be treated as 
integers for comparison purposes. 

Referring to Table 1, single precision 
has e max = 127 and e min = -126. The 
reason for having | e mm | < e max is so that 
the reciprocal of the smallest number 
(l/ 2 e mm) will not overflow. Although it is 
true that the reciprocal of the largest 
number will underflow, underflow is usu¬ 
ally less serious than overflow. Section 
2.1.1 explained that e mm - 1 is used for 
representing 0, and Section 2.2 will in¬ 
troduce a use for e raax + 1. In IEEE sin¬ 
gle precision, this means that the biased 
exponents range between e mm - 1 = 
—127 and e max + 1 = 128 whereas the 
unbiased exponents range between 0 
and 255, which are exactly the nonneg¬ 
ative numbers that can be represented 
using 8 bits. 

2.1.4 Operations 

The IEEE standard requires that the re¬ 
sult of addition, subtraction, multiplica¬ 
tion, and division be exactly rounded. 
That is, the result must be computed 
exactly then rounded to the nearest float¬ 
ing-point number (using round to even). 
Section 1.3 pointed out that computing 
the exact difference or sum of two float- 
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ing-point numbers can be very expensive 
when their exponents are substantially 
different. That section introduced guard 
digits, which provide a practical way of 
computing differences while guarantee¬ 
ing that the relative error is small. Com¬ 
puting with a single guard digit, 
however, will not always give the same 
answer as computing the exact result 
then rounding. By introducing a second 
guard digit and a third sticky bit, differ¬ 
ences can be computed at only a little 
more cost than with a single guard digit, 
but the result is the same as if the differ¬ 
ence were computed exactly then rounded 
[Goldberg 1990]. Thus, the standard can 
be implemented efficiently. 

One reason for completely specifying 
the results of arithmetic operations is to 
improve the portability of software. When 
a program is moved between two ma¬ 
chines and both support IEEE arith¬ 
metic, if any intermediate result differs, 
it must be because of software bugs not 
differences in arithmetic. Another ad¬ 
vantage of precise specification is that it 
makes it easier to reason about floating 
point. Proofs about floating point are 
hard enough without having to deal with 
multiple cases arising from multiple 
kinds of arithmetic. Just as integer pro¬ 
grams can be proven to be correct, so can 
floating-point programs, although what 
is proven in that case is that the round¬ 
ing error of the result satisfies certain 
bounds. Theorem 4 is an example of such 
a proof. These proofs are made much eas¬ 
ier when the operations being reasoned 
about are precisely specified. Once an 
algorithm is proven to be correct for IEEE 
arithmetic, it will work correctly on any 
machine supporting the IEEE standard. 

Brown [1981] has proposed axioms for 
floating point that include most of the 
existing floating-point hardware. Proofs 
in this system cannot, however, verify 
the algorithms of Sections 1.4 and 1.5, 
which require features not present on all 
hardware. Furthermore, Brown’s axioms 
are more complex than simply defining 
operations to be performed exactly then 
rounded. Thus, proving theorems from 
Brown’s axioms is usually more difficult 


than proving them assuming operations 
are exactly rounded. 

There is not complete agreement on 
what operations a floating-point stand¬ 
ard should cover. In addition to the basic 
operations +, -, X, and /, the IEEE 
standard also specifies that square root, 
remainder, and conversion between inte¬ 
ger and floating point be correctly 
rounded. It also requires that conversion 
between internal formats and decimal be 
correctly rounded (except for very large 
numbers). Kulisch and Miranker [1986] 
have proposed adding inner product to 
the list of operations that are precisely 
specified. They note that when inner 
products are computed in IEEE arith¬ 
metic, the final answer can be quite 
wrong. For example, sums are a special 
case of inner products, and the sum ((2 x 
10“ 30 + 10 3 °) - 10“ 3 °) - 10 3 ° is exactly 
equal to 10“ 30 , but on a machine with 
IEEE arithmetic the computed result 
will be -10” 30 . It is possible to compute 
inner products to within 1 ulp with less 
hardware than it takes to imple¬ 
ment a fast multiplier [Kirchner and 
Kulisch 1987], 9 

All the operations mentioned in the 
standard, except conversion between dec¬ 
imal and binary, are required to be 
exactly rounded. The reason is that effi¬ 
cient algorithms for exactly rounding all 
the operations, except conversion, are 
known. For conversion, the best known 
efficient algorithms produce results that 
are slightly worse than exactly rounded 
ones [Coonen 1984], 

The IEEE standard does not require 
transcendental functions to be exactly 
rounded because of the table maker’s 
dilemma. To illustrate, suppose you are 
making a table of the exponential func¬ 
tion to four places. Then exp(1.626) = 
5.0835. Should this be rounded to 5.083 
or 5.084? If exp(1.626) is computed more 
carefully, it becomes 5.08350, then 


9 Some arguments against including inner product 
as one of the basic operations are presented by 
Kahan and LeBlanc [1985]. 
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5.083500, then 5.0835000. Since exp is 
transcendental, this could go on arbitrar¬ 
ily long before distinguishing whether 
exp(1.626) is 5.083500 ••• 0 ddd or 
5.0834999 • • • 9 ddd. Thus, it is not prac¬ 
tical to specify that the precision of tran¬ 
scendental functions be the same as if 
the functions were computed to infinite 
precision then rounded. Another ap¬ 
proach would be to specify transcenden¬ 
tal functions algorithmically. But there 
does not appear to be a single algorithm 
that works well across all hardware ar¬ 
chitectures. Rational approximation, 
CORDIC, 10 and large tables are three 
different techniques used for computing 
transcendentals on contemporary ma¬ 
chines. Each is appropriate for a differ¬ 
ent class of hardware, and at present no 
single algorithm works acceptably over 
the wide range of current hardware. 

2.2 Special Quantities 

On some floating-point hardware every 
bit pattern represents a valid floating¬ 
point number. The IBM System/370 is 
an example of this. On the other hand, 
the VAX reserves some bit patterns to 
represent special numbers called re¬ 
served operands. This idea goes back to 
the CDC 6600, which had bit patterns for 
the special quantities INDEFINITE and 
INFINITY. 

The IEEE standard continues in this 
tradition and has NaNs (Not a Number, 
pronounced to rhyme with plan) and in¬ 
finities. Without special quantities, there 
is no good way to handle exceptional sit¬ 
uations like taking the square root of a 
negative number other than aborting 
computation. Under IBM System/370 
FORTRAN, the default action in re¬ 
sponse to computing the square root of a 
negative number like —4 results in the 
printing of an error message. Since every 


10 CORDIC is an acronym for Coordinate Rotation 
Digital Computer and is a method of computing 
transcendental functions that uses mostly shifts 
and adds (i.e., very few multiplications and divi¬ 
sions) [Walther 1971], It is the method used on both 
the Intel 8087 and the Motorola 68881. 


Table 2. IEEE 754 Special Values 


Exponent 

Fraction 

Represents 

« = e mm - 1 

f= o 

±0 

e = e mm - 1 

f* 0 

O.fx 2 < ’ mm 

(> *£. p < p 

^min — — max 

— 

1 fx 2° 

^ — ^max ^ 

/■= 0 

± CO 

e ~ e max + 1 

f* 0 

NaN 


bit pattern represents a valid num¬ 
ber, the return value of square root 
must be some floating-point number. 
In the case of System/370 FORTRAN, 
— 4 | = 2 is returned. In IEEE arith¬ 
metic, an NaN is returned in this 
situation. 

The IEEE standard specifies the fol¬ 
lowing special values (see Table 2): ±0, 
denormalized numbers, ± oo and NaNs 
(there is more than one NaN, as ex¬ 
plained in the next section). These 
special values are all encoded with 
exponents of either e max + 1 or e mm - 1 
(it was already pointed out that 0 has an 
exponent of e mm - 1). 

2.2.1 NaNs 

Tr adit ionally, the computation of 0/0 or 
V — 1 has been treated as an unrecover¬ 
able error that causes a computation to 
halt. There are, however, examples for 
which it makes sense for a computation 
to continue in such a situation. Consider 
a subroutine that finds the zeros of a 
function f, say zero(f). Traditionally, 
zero finders require the user to input an 
interval [a, b] on which the function is 
defined and over which the zero finder 
will search. That is, the subroutine is 
called as zero(f, a, b). A more useful zero 
finder would not require the user to in¬ 
put this extra information. This more 
general zero finder is especially appropri¬ 
ate for calculators, where it is natural to 
key in a function and awkward to then 
have to specify the domain. It is easy, 
however, to see why most zero finders 
require a domain. The zero finder does 
its work by probing the function f at 
various values. If it probed for a value 
outside the domain of f, the code for f 
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Table 3. Operations that Produce an NaN 


Operation 

NaN Produced by 

+ 

00 + (- oo) 

X 

0 X oo 

/ 

0/0, oo/oo 

REM 

x REM 0, oo REM y 

V''' 

•Jx (when x < 0) 


might well compute 0/0 or V — 1, and 
the computation would halt, unnecessar¬ 
ily aborting the zero finding process. 

This problem can be avoided by intro¬ 
ducing a special value called NaN and 
specifying that the computation of ex¬ 
pressions like 0/0 and V — 1 produce 
NaN rather than halting. (A list of some 
of the situations that can cause a NaN is 
given in Table 3.) Then, when zero(f) 
probes outside the domain of f, the code 
for f will return NaN and the zero finder 
can continue. That is, zero(f) is not 
“punished” for making an incorrect 
guess. With this example in mind, it is 
easy to see what the result of combining 
a NaN with an ordinary floating-point 
number should be. Suppose the final 
statement of f is return! - b +sqrt(d))/ 
(2 * a). If d < 0, then f should return a 
NaN. Since d < 0, sqrt(d) is an NaN, 
and -b + sqrt(d) will be a NaN if the 
sum of an NaN and any other number 
is a NaN. Similarly, if one operand 
of a division operation is an NaN, 
the quotient should be a NaN. In 
general, whenever a NaN participates 
in a floating-point operation, the 
result is another NaN. 

Another approach to writing a zero 
solver that does not require the user to 
input a domain is to use signals. The 
zero finder could install a signal handler 
for floating-point exceptions. Then if f 
were evaluated outside its domain and 
raised an exception, control would be re¬ 
turned to the zero solver. The problem 
with this approach is that every lan¬ 
guage has a different method of handling 
signals (if it has a method at all), and so 
it has no hope of portability. 

In IEEE 754, NaNs are represented as 
floating-point numbers with the expo¬ 


nent e max + 1 and nonzero significands. 
Implementations are free to put system- 
dependent information into the signifi- 
cand. Thus, there is not a unique NaN 
but rather a whole family of NaNs. When 
an NaN and an ordinary floating-point 
number are combined, the result should 
be the same as the NaN operand. Thus, 
if the result of a long computation is an 
NaN, the system-dependent information 
in the significand will be the information 
generated when the first NaN in the 
computation was generated. Actually, 
there is a caveat to the last statement. If 
both operands are NaNs, the result will 
be one of those NaNs but it might not be 
the NaN that was generated first. 

2.2.2 Infinity 

Just as NaNs provide a way to continue 
a computation when expressions like 0/0 
or V — 1 are encountered, infinities pro¬ 
vide a way to continue when an overflow 
occurs. This is much safer than simply 
returning to the largest representable 
number. As an e xample, consider com¬ 
puting \J x 2 + y 2 , when /3 = 10, p = 3, 
and e max = 98. If x = 3 x 10 70 and 
y = 4 x 10 70 , then x 2 will overflow and 
be replaced by 9.99 X 10 98 . Similarly y 2 
and x 2 + y 2 will each overflow in turn 
and be replaced by 9.99 X 10 98 . So the 
final result will be (9.99 X 10 98 ) 1/2 = 
3.16 X 10 49 , which is drastically wrong. 
The correct answer is 5 x 10 7 °. In IEEE 
arithmetic, the result of % 2 is oo, as is 
y 2 , x 2 + y 2 , and \Jx 2 + y 2 . So the final 
result is oo, which is safer than 
returning an ordinary floating-point 
number that is nowhere near the correct 
answer. 11 

The division of 0 by 0 results in an 
NaN. A nonzero number divided by 0, 
however, returns infinity: 1/0 = oo, 
- 1/0 = — oo. The reason for the distinc¬ 
tion is this: If f(x) 0 and g(x) 0 as 


11 Fine point: Although the default in IEEE arith¬ 
metic is to round overflowed numbers to oo, it is 
possible to change the default (see Section 2.3.2). 
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x approaches some limit, then f(x)/g(x) 
could have any value. For example, 
when f(x) = sin x and g(x) = x, then 
f(x)/g(x) -> 1 as jc-» 0. But when f(x) 
= 1 — cos x, f(x)/g(x) 0. When 

thinking of 0/0 as the limiting situation 
of a quotient of two very small numbers, 
0/0 could represent anything. Thus, in 
the IEEE standard, 0/0 results in an 
NaN. But when c > 0 and f(x) -*■ c, g(x) 
—<■ 0, then f{x)/g(x) -> ±oo for any ana¬ 
lytic functions f and g. If g(x) < 0 for 
small x, then f(x)/g(x) -> -oo; other¬ 
wise the limit is +oo. So the IEEE stan¬ 
dard defines c/0 = ± oo as long as c 4= 0. 
The sign of oo depends on the signs of c 
and 0 in the usual way, so —10/0 = — oo 
and -10/-0= +oo. You can distin¬ 
guish between getting oo because of over¬ 
flow and getting oo because of division by 
0 by checking the status flags (which will 
be discussed in detail in Section 2.3.3). 
The overflow flag will be set in the first 
case, the division by 0 flag in the second. 

The rule for determining the result of 
an operation that has infinity as an 
operand is simple: Replace infinity with 
a finite number x and take the limit as 
x -*■ oo . Thus, 3/oo = 0, because 
lim^^S/x = 0. Similarly 4 - oo = - oo 
and 'fo o’ = oo. When the limit does not 
exist, the result is an NaN, so oo/oo will 
be an NaN (Table 3 has additional exam¬ 
ples). This agrees with the reasoning used 
to conclude that 0/0 should be an NaN. 

When a subexpression evaluates to a 
NaN, the value of the entire expression 
is also a NaN. In the case of ± oo, how¬ 
ever, the value of the expression might 
be an ordinary floating-point number be¬ 
cause of rules like l/oo = 0. Here is a 
practical example that makes use of the 
rules for infinity arithmetic. Consider 
computing the function x/(x 2 + 1). This 
is a bad formula, because not only will it 
overflow when x is larger than 
v /3 (fo™* /2 , but infinity arithmetic will 
give the wrong answer because it will 
yield 0 rather than a number near 1/x. 
However, x/{x 2 -I- 1) can be rewritten as 
l/(x + x^ 1 ). This improved expression 
will not overflow prematurely and be¬ 
cause of infinity arithmetic will have the 


correct value when x = 0: 1/(0 + 0 -1 ) = 
1/(0 + oo) = l/oo = 0. Without infinity 
arithmetic, the expression l/(x + x _1 ) 
requires a test for x = 0, which not only 
adds extra instructions but may also dis¬ 
rupt a pipeline. This example illustrates 
a general fact; namely, that infinity 
arithmetic often avoids the need for spe¬ 
cial case checking; however, formulas 
need to be carefully inspected to make 
sure they do not have spurious behavior 
at infinity [as x/(x 2 + 1) did]. 

2.2.3 Signed Zero 

Zero is represented by the exponent 
e mm — 1 and a zero significand. Since the 
sign bit can take on two different values, 
there are two zeros, +0 and -0. If a 
distinction were made when comparing 
+ 0 and -0, simple tests like if (x = 0) 
would have unpredictable behavior, de¬ 
pending on the sign of x. Thus, the IEEE 
standard defines comparison so that 
+ 0 = -0 rather than -0 < +0. Al¬ 
though it would be possible always to 
ignore the sign of zero, the IEEE stan¬ 
dard does not do so. When a multiplica¬ 
tion or division involves a signed zero, 
the usual sign rules apply in computing 
the sign of the answer. Thus, 3(4-0) = +0 
and +0/— 3 = -0. If zero did not have a 
sign, the relation 1/(1/ jc) = x would fail 
to hold when x= ±oo. The reason is 
that 1/— oo and 1 / 4 - oo both result in 0, 
and 1/0 results in 4-°°, the sign informa¬ 
tion having been lost. One way to restore 
the identity 1/(1/x) = x is to have 
only one kind of infinity; however, 
that would result in the disastrous 
consequence of losing the sign of an 
overflowed quantity. 

Another example of the use of signed 
zero concerns underflow and functions 
that have a discontinuity at zero such as 
log. In IEEE arithmetic, it is natural to 
define log 0 = — oo and log x to be an 
NaN when x < 0. Suppose x represents 
a small negative number that has under¬ 
flowed to zero. Thanks to signed zero, x 
will be negative so log can return an 
NaN. If there were no signed zero, 
however, the log function could not 


ACM Computing Surveys, Vol. 23, No. 1, March 1991 



Floating-Point Arithmetic 


23 


distinguish an underflowed negative 
number from 0 and would therefore have 
to return - oo. Another example of a 
function with a discontinuity at zero is 
the signum function, which returns the 
sign of a number. 

Probably the most interesting use of 
signed zero occurs in complex arithmetic. 
As an example, consider the equation 
1 /I /2 = 1 f \fz . This is certainly true 
when 2 > 0. If 2 = — 1, the obvio us com¬ 
putation gives /l/- 1 = V— 1 = i and 
1/V — 1 = 1/i = —i. Thus, \Jl/ z 
1 / \[z ! The problem can be traced to the 
fact that square root is multivalued, and 
there is no way to select the values so 
they are continuous in the entire com¬ 
plex plane. Square root is continuous, 
however, if a branch cut consisting of all 
negative real numbers is excluded from 
consideration. This leaves the problem of 
what to do for the negative real numbers, 
which are of the form — x + iO, where 
x > 0. Signed zero provides a perfect way 
to resolve this problem. Numbers of the 
form -x + i( + 0) have a square root of 
i'/x, and numbers of the form — x + 
i( - 0) on the other side of the branch cut 
have a square root with the other sign 
(-iVx). In fact, the natural formulas for 
computing V will give these results. 

Let us return to \J\j 2 =1/ \fz . If 2 = 
-1 = - 1 + iO, then 

I /2 = l/(— 1 + iO) 

1(-1 - iO) 

" (-1 + i0)( -1 - iO) 

= (-l-i0)/((-l) 2 -0 2 ) 

= — 1 + i( —0), 

so \/l/z = \/- 1 + i(-0) = - i, while 
l// 2 = 1/i = -i. Thus, IEEE arith¬ 
metic preserves this identity for all 2 . 
Some more sophisticated examples are 
given by Kahan [1987], Although distin¬ 
guishing between +0 and —0 has advan¬ 
tages, it can occasionally be confusing. 
For example, signed zero destroys the 
relation x = y & 1/ x = 1 / y, which is 
false when x = +0 and y = — 0. The 


IEEE committee decided, however, that 
the advantages of using signed zero out¬ 
weighed the disadvantages. 

2 . 2 .4 Denormalized Numbers 

Consider normalized floating-point num¬ 
bers with $ = 10, p = 3, and e mm = -98. 
The numbers x = 6.87 X 10 ” 97 and y = 
6.81 X 10 97 appear to be perfectly ordi¬ 
nary floating-point numbers, which are 
more than a factor of 10 larger than the 
smallest floating-point number 1.00 x 
10~ 98 . They have a strange property, 
however: x © y = 0 even though x yl 
The reason is that x — y = .06 X 10“ 97 
= 6.0 X 10 99 is too small to be repre¬ 
sented as a normalized number and so 
must be flushed to zero. 

How important is it to preserve the 
property 

x = y**x — y = 0? (10) 

It is very easy to imagine writing the 
code fragment if (x ^ y) then z = 1 / 
(x - y) and later having a program fail 
due to a spurious division by zero. Track¬ 
ing down bugs like this is frustrating 
and time consuming. On a more philo¬ 
sophical level, computer science text¬ 
books often point out that even though it 
is currently impractical to prove large 
programs correct, designing programs 
with the idea of proving them often re¬ 
sults in better code. For example, intro¬ 
ducing invariants is useful, even if they 
are not going to be used as part of a 
proof. Floating-point code is just like any 
other code: It helps to have provable facts 
on which to depend. For example, when 
analyzing formula (7), it will be helpful 
to know that x/2<y<2x=*x © y = x 
- y (see Theorem 11). Similarly, know¬ 
ing that (10) is true makes writing reli¬ 
able floating-point code easier. If it is 
only true for most numbers, it cannot be 
used to prove anything. 

The IEEE standard uses denormal¬ 
ized 12 numbers, which guarantee (10), as 


12 They are called subnormal in 854, denormal in 
754. 
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Figure 2. Flush to zero compared with gradual underflow. 


well as other useful relations. They are 
the most controversial part of the stan¬ 
dard and probably accounted for the long 
delay in getting 754 approved. Most 
high-performance hardware that claims 
to be IEEE compatible does not support 
denormalized numbers directly but 
rather traps when consuming or produc¬ 
ing denormals, and leaves it to software 
to simulate the IEEE standard. 13 The 
idea behind denormalized numbers goes 
back to Goldberg [1967] and is simple. 
When the exponent is e mm , the signifi- 
cand does not have to be normalized. For 
example, when /3 = 10, p = 3, and e min 
= -98, 1.00 X 10“ 98 is no longer the 
smallest floating-point number, because 
0.98 X 10 98 is also a floating-point 
number. 

There is a small snag when 0 = 2 and 
a hidden bit is being used, since a num¬ 
ber with an exponent of e min will always 
have a significand greater than or equal 
to 1.0 because of the implicit leading bit. 
The solution is similar to that used to 
represent 0 and is summarized in Table 
2. The exponent - 1 is used to rep¬ 
resent denormals. More formally, if the 
bits in the significant field are b v 
b 2 , ■ ■ ■, b p _ 1 and the value of the expo¬ 
nent is e, then when e > e min - 1, the 
number being represented is 1 ■b 1 b 2 ••• 
b ■ i x 2 s , whereas when e — e mm — 1, 
the number being represented is 0.b x b 2 
• ■ ■ b p _ 1 x 2 e+1 . The +1 in the exponent 
is needed because denormals have an ex¬ 
ponent of e min , not e mm - 1. 


13 This is the cause of one of the most troublesome 
aspects of the standard. Programs that frequently 
underflow often run noticeably slower on hardware 
that uses software traps. 


Recall the example 0 — 10, p = 3, e mm 
= -98, x = 6.87 X 10- 97 , and y = 6.81 
X 10- 97 presented at the beginning of 
this section. With denormals, x — y does 
not flush to zero but is instead repre¬ 
sented by the denormalized number 
.6 X 10~ 98 . This behavior is called 
gradual underflow. It is easy to verify 
that (10) always holds when using 
gradual underflow. 

Figure 2 illustrates denormalized 
numbers. The top number line in the 
figure shows normalized floating-point 
numbers. Notice the gap between 0 and 
the smallest normalized number 1.0 X 
0 Bmm . If the result of a floating-point cal¬ 
culation falls into this gulf, it is flushed 
to zero. The bottom number line shows 
what happens when denormals are added 
to the set of floating-point numbers. The 
“gulf’ is filled in; when the result of a 
calculation is less than 1.0 X 0 e ™ m , it is 
represented by the nearest denormal. 
When denormalized numbers are added 
to the number line, the spacing between 
adjacent floating-point numbers varies in 
a regular way: Adjacent spacings are ei¬ 
ther the same length or differ by a factor 
of 0. Without denormals, the spacing 
abruptly changes from /3 p+1 /3 emm to 0 Smm , 
which is a factor of 0 p 1 , rather than the 
orderly change by a factor of 0. Because 
of this, many algorithms that can have 
large relative error for normalized num¬ 
bers close to the underflow threshold are 
well behaved in this range when gradual 
underflow is used. 

Without gradual underflow, the simple 
expression x + y can have a very large 
relative error for normalized inputs, as 
was seen above for x = 6.87 X 10~ 9 ' and 
;y = 6.81 X 10 97 . Large relative errors 
can happen even without cancellation, as 
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the following example shows [Demmel 
1984], Consider dividing two complex 
numbers, a + ib and c + id. The obvious 
formula 

a + ib ac + bd be — ad 

c + id c 2 + d 2 c 2 + d 2 

suffers from the problem that if either 
component of the denominator c + id is 
larger than y/g/3 e nrn*/ 2 , the formula will 
overflow even though the final result may 
be well within range. A better method of 
computing the quotients is to use Smith’s 
formula: 

a + b(d / c) b — a(d/c ) 

c+d(d/c) c+d(d/c) 

a + ib if | d \ < | c | 

c + id b + a(c/d ) — a 4- b(c/d) 

d + c(c/d) d + c(c/d) 
if | d | > | c |. 

(11) 


Applying Smith’s formula to 


2 • 10- 98 + ilO” 98 
4 • 10 98 + i(2 • 10“ 98 ) 


gives the correct answer of 0.5 with grad¬ 
ual underflow. It yields 0.4 with flush to 
zero, an error of 100 ulps. It is typical for 
denormalized numbers to guarantee er¬ 
ror bounds for arguments all the way 
down to 1.0 X f3 e,mn . 


2.3 Exceptions, Flags, and Trap Handlers 

When an exceptional condition like divi¬ 
sion by zero or overflow occurs in IEEE 
arithmetic, the default is to deliver a 
result and continue. Typical of the de¬ 
fault results are NaN for 0/0 and v - 1 
and oo for 1/0 and overflow. The preced¬ 
ing sections gave examples where pro¬ 
ceeding from an exception with these 
default values was the reasonable thing 
to do. When any exception occurs, a sta¬ 
tus flag is also set. Implementations of 
the IEEE standard are required to pro¬ 
vide users with a way to read and write 
the status flags. The flags are “sticky” in 


that once set, they remain set until ex¬ 
plicitly cleared. Testing the flags is the 
only way to distinguish 1/0, which is a 
genuine infinity from an overflow. 

Sometimes continuing execution in the 
face of exception conditions is not appro¬ 
priate. Section 2.2.2 gave the example of 
x/(x 2 + 1). When x> the 

denominator is infinite, resulting in a 
final answer of 0, which is totally wrong. 
Although for this formula the problem 
can be solved by rewriting it as 1 / 
(x + x 1 ), rewriting may not always 
solve the problem. The IEEE standard 
strongly recommends that implementa¬ 
tions allow trap handlers to be installed. 
Then when an exception occurs, the trap 
handler is called instead of setting the 
flag. The value returned by the trap 
handler will be used as the result of 
the operation. It is the responsibility 
of the trap handler to either clear or set 
the status flag; otherwise, the value of 
the flag is allowed to be undefined. 

The IEEE standard divides exceptions 
into five classes: overflow, underflow, di¬ 
vision by zero, invalid operation, and in¬ 
exact. There is a separate status flag for 
each class of exception. The meaning of 
the first three exceptions is self-evident. 
Invalid operation covers the situations 
listed in Table 3. The default result of an 
operation that causes an invalid excep¬ 
tion is to return an NaN, but the con¬ 
verse is not true. When one of the 
operands to an operation is an NaN, the 
result is an NaN, but an invalid excep¬ 
tion is not raised unless the operation 
also satisfies one of the conditions in 
Table 3. 

The inexact exception is raised when 
the result of a floating-point operation is 
not exact. In the (3 = 10, p = 3 system, 
3.5 <8> 4.2 = 14.7 is exact, but 3.5 <8> 4.3 
= 15.0 is not exact (since 3.5 ■ 4.3 = 
15.05) and raises an inexact exception. 
Section 4.2 discusses an algorithm that 
uses the inexact exception. A summary 
of the behavior of all five exceptions is 
given in Table 4. 

There is an implementation issue con¬ 
nected with the fact that the inexact ex¬ 
ception is raised so often. If floating-point 
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Table 4. Exceptions in IEEE 754 a 


Exception 

Result When Traps Disabled 

Argument to Trap Handler 

Overflow 

±oo or ±x max 

Round! x2 _ “) 

Underflow 

0, ± 2 Bmm or denormal 

Round! x2“) 

Divide by zero 

± oo 

Operands 

Invalid 

NaN 

Operands 

Inexact 

round! x) 

round! x) 


Is the exact result of the operation, a = 192 for single precision, 1536 for 
double, and x max = 1.11 • • ■ 11 x 2 3 ™u 


hardware does not have flags of its own 
but instead interrupts the operating sys¬ 
tem to signal a floating-point exception, 
the cost of inexact exceptions could be 
prohibitive. This cost can be avoided by 
having the status flags maintained by 
software. The first time an exception is 
raised, set the software flag for the ap¬ 
propriate class and tell the floating-point 
hardware to mask off that class of excep¬ 
tions. Then all further exceptions will 
run without interrupting the operating 
system. When a user resets that status 
flag, the hardware mask is reenabled. 

2.3.1 Trap Handlers 

One obvious use for trap handlers is for 
backward compatibility. Old codes that 
expect to be aborted when exceptions oc¬ 
cur can install a trap handler that aborts 
the process. This is especially useful for 
codes with a loop like do S until (x > = 
100). Since comparing a NaN to a num¬ 
ber with <,<,>,>, or = (but not 
always returns false, this code will go 
into an infinite loop if x ever becomes 
an NaN. 

There is a more interesting use for 
trap handlers that comes up when com¬ 
puting products such as IT" = x x l that could 
potentially overflow. One solution is to 
use logarithms and compute expQNog xj 
instead. The problems with this ap¬ 
proach are that it is less accurate and 
costs more than the simple expression 
Ux„ even if there is no overflow. There is 
another solution using trap handlers 
called over / underflow counting that 
avoids both of these problems [Sterbenz 
1974], 

The idea is as follows: There is a global 
counter initialized to zero. Whenever the 


partial product p k = I\ k l=l x l overflows for 
some k, the trap handler increments the 
counter by 1 and returns the overflowed 
quantity with the exponent wrapped 
around. In IEEE 754 single precision, 
e max = 127, so if p k = 1.45 X 2 130 , it will 
overflow and cause the trap handler to be 
called, which will wrap the exponent back 
into range, changing p k to 1.45 X 2~ 62 
(see below). Similarly, if p k underflows, 
the counter would be decremented and 
the negative exponent would get wrapped 
around into a positive one. When all the 
multiplications are done, if the counter is 
zero, the final product is p n . If the 
counter is positive, the product is over¬ 
flowed; if the counter is negative, it un¬ 
derflowed. If none of the partial products 
is out of range, the trap handler is never 
called and the computation incurs no ex¬ 
tra cost. Even if there are over/under¬ 
flows, the calculation is more accurate 
than if it had been computed with loga¬ 
rithms, because each p k was computed 
from p k _ 1 using a full-precision multi¬ 
ply. Barnett [1987] discusses a formula 
where the full accuracy of over/under¬ 
flow counting turned up an error in ear¬ 
lier tables of that formula. 

IEEE 754 specifies that when an over¬ 
flow or underflow trap handler is called, 
it is passed the wrapped-around result as 
an argument. The definition of wrapped 
around for overflow is that the result is 
computed as if to infinite precision, then 
divided by 2“, then rounded to the rele¬ 
vant precision. For underflow, the result 
is multiplied by 2“. The exponent a is 
192 for single precision and 1536 for dou¬ 
ble precision. This is why 1.45 x 2 130 was 
transformed into 1.45 X 2 62 in the ex¬ 
ample above. 
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2.3.2 Rounding Modes 

In the IEEE standard, rounding occurs 
whenever an operation has a result that 
is not exact, since (with the exception of 
binary decimal conversion) each opera¬ 
tion is computed exactly then rounded. 
By default, rounding means round to¬ 
ward nearest. The standard requires that 
three other rounding modes be provided; 
namely, round toward 0, round toward 
+ oo, and round toward — oo. When used 
with the convert to integer operation, 
round toward — oo causes the convert to 
become the floor function, whereas, round 
toward + oo is ceiling. The rounding mode 
affects overflow because when round to¬ 
ward 0 or round toward - oo is in effect, 
an overflow of positive magnitude causes 
the default result to be the largest repre¬ 
sentable number, not + oo . Similarly, 
overflows of negative magnitude will 
produce the largest negative number 
when round toward + oo or round toward 
0 is in effect. 

One application of rounding modes oc¬ 
curs in interval arithmetic (another is 
mentioned in Section 4.2). When using 
interval arithmetic, the sum of two num¬ 
bers x and y is an interval [ z, z], where 
z is x © y rounded toward - oo and z is 
x © y rounded toward +oo. The exact 
result of the addition is contained within 
the interval [z, z]. Without rounding 
modes, interval arithmetic is usually im¬ 
plemented by computing z = (x © y) 
(1 - e) and z = (x © y)(l + e), where e 
is machine epsilon. This results in over¬ 
estimates for the size of the intervals. 
Since the result of an operation in inter¬ 
val arithmetic is an interval, in general 
the input to an operation will also be an 
interval. If two intervals [x, 3c] and [y, y] 
are added, the result is [ z, z], wherehz is 
x © y with the rounding mode set to 
round toward — oo , and z is x © z with 
the rounding mode set toward + oo. 

When a floating-point calculation is 
performed using interval arithmetic, the 
final answer is an interval that contains 
the exact result of the calculation. This 
is not very helpful if the interval turns 
out to be large (as it often does), since the 


correct answer could be anywhere in that 
interval. Interval arithmetic makes more 
sense when used in conjunction with a 
multiple precision floating-point pack¬ 
age. The calculation is first performed 
with some precision p. If interval arith¬ 
metic suggests that the final answer may 
be inaccurate, the computation is redone 
with higher and higher precisions until 
the final interval is a reasonable size. 

2.3.3 Flags 

The IEEE standard has a number of flags 
and modes. As discussed above, there is 
one status flag for each of the five excep¬ 
tions: underflow, overflow, division by 
zero, invalid operation, and inexact. 
There are four rounding modes: round 
toward nearest, round toward + oo, round 
toward 0, and round toward — °°. It is 
strongly recommended that there be an 
enable mode bit for each of the five ex¬ 
ceptions. This section gives some exam¬ 
ples of how these modes and flags can be 
put to good use. A more sophisticated 
example is discussed in Section 4.2. 

Consider writing a subroutine to com¬ 
pute x n , where n is an integer. When 
n > 0, a simple routine like 

PositivePower(x,n) { 

while (n is even) { 
x = x*x 
n = n/2 

} 

U = X 

while (true) { 
n = n/2 

if (n = = 0) return u 

X = X * X 

if (n is odd) u = u * x 

} 

} 

will compute x n . 

If n < 0, the most accurate way to 
compute x" is not to call Positive- 
Power(l/x, -n) but rather 1/Posi- 
tivePower(x, -n), because the first 
expression multiplies n quantities, each 
of which has a rounding error from the 
division (i.e., 1/x). In the second expres¬ 
sion these are exact (i.e., x) and the final 
division commits just one additional 
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rounding error. Unfortunately, there is a 
slight snag in this strategy. If Positive- 
Powerfx, -n) underflows, then either 
the underflow trap handler will be called 
or the underflow status flag will be set. 
This is incorrect, because if x~ n under¬ 
flows, then x n will either overflow or be 
in range. 14 But since the IEEE standard 
gives the user access to all the flags, the 
subroutine can easily correct for this. 
It turns off the overflow and underflow 
trap enable bits and saves the overflow 
and underflow status bits. It then com¬ 
putes l/PositivePower(x, -n). If nei¬ 
ther the overflow nor underflow status 
bit is set, it restores them together with 
the trap enable bits. If one of the status 
bits is set, it restores the flags and redoes 
the calculation using PositivePower 
(1/x, — n), which causes the correct ex¬ 
ceptions to occur. 

Another example of the use of flags 
occurs when computing arccos via the 
formula 


1 - x 

arccos x = 2 arctan t /- . 

V 1 + v 

If arctan(oo) evaluates to ir /2, then are- 
cos(—1) will correctly evaluate to 
2 arctan(oo) = x because of infinity arith¬ 
metic. There is a small snag, however, 
because the computation of (1 - x)/ 
(1 + x) will cause the divide by zero ex¬ 
ception flag to be set, even though arc- 
cos(-l) is not exceptional. The solution 
to this problem is straightforward. Sim¬ 
ply save the value of the divide by zero 
flag before computing arccos, then re¬ 
store its old value after the computation. 

3. SYSTEMS ASPECTS 

The design of almost every aspect of a 
computer system requires knowledge 
about floating point. Computer architec- 


14 It can be in range because if x < 1, n < 0, and 

x~ n is just a tiny bit smaller than the underflow 

threshold 2 e “'», then x n ~ 2~ e ”> m < 2 e ">“‘ and so 

may not overflow, since in all IEEE precisions, 

— P <1 P 

mm ^ ''max • 


tures usually have floating-point instruc¬ 
tions, compilers must generate those 
floating-point instructions, and the oper¬ 
ating system must decide what to do 
when exception conditions are raised for 
those floating-point instructions. Com¬ 
puter system designers rarely get guid¬ 
ance from numerical analysis texts, 
which are typically aimed at users and 
writers of software not at computer 
designers. 

As an example of how plausible design 
decisions can lead to unexpected behav¬ 
ior, consider the following BASIC 
program: 

q = 3.0/7.0 

if q = 3.0/7.0 then print “Equal”: 

else print “Not Equal” 

When compiled and run using Borland’s 
Turbo Basic 15 on an IBM PC, the pro¬ 
gram prints Not Equal! This example 
will be analyzed in Section 3.2.1. 

Incidentally, some people think that 
the solution to such anomalies is never to 
compare floating-point numbers for 
equality but instead to consider them 
equal if they are within some error bound 
E. This is hardly a cure all, because it 
raises as many questions as it answers. 
What should the value of E be? If x < 0 
and y > 0 are within E, should they re¬ 
ally be considered equal, even though 
they have different signs? Furthermore, 
the relation defined by this rule, a ~ b o 
j a - b | < E, is not an equivalence rela¬ 
tion because a ~ b and b ~ c do not 
imply that a ~ c. 

3.1 Instruction Sets 

It is common for an algorithm to require 
a short buret of higher precision in order 
to produce accurate results. One example 
occu rs in the quadratic formula [ — b 
± Vb 2 — 4ac /2 a. As discussed in Sec¬ 
tion 4.1, when b 2 = 4ac, rounding error 
can contaminate up to half the digits in 
the roots computed with the quadratic 


lo Turbo Basic is a registered trademark of Borland 
International, Inc. 
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formula. By performing the subcalcula¬ 
tion of b ' 2 - A ac in double precision, half 
the double precision bits of the root are 
lost, which means that all the single pre¬ 
cision bits are preserved. 

The computation of 6 2 - 4 ac in double 
precision when each of the quantities a, 
b, and c are in single precision is easy if 
there is a multiplication instruction that 
takes two single precision numbers and 
produces a double precision result. To 
produce the exactly rounded product of 
two p-digit numbers, a multiplier needs 
to generate the entire 2 p bits of product, 
although it may throw bits away as it 
proceeds. Thus, hardware to compute a 
double-precision product from single-pre¬ 
cision operands will normally be only a 
little more expensive than a single-preci¬ 
sion multiplier and much less expensive 
than a double-precision multiplier. De¬ 
spite this, modern instruction sets tend 
to provide only instructions that produce 
a result of the same precision as the 
operands. 16 

If an instruction that combines two 
single-precision operands to produce a 
double-precision product were only useful 
for the quadratic formula, it would not be 
worth adding to an instruction set. This 
instruction has many other uses, how¬ 
ever. Consider the problem of solving a 
system of linear equations: 

au*i + a 12 x 2 + + a ln x n = b x 

<*21*1 + «22*2 + ••• +a 2 n x n = b 2 

«»i*i + a n2 x 2 + ■■■ +a nn x n = b n , 
which can be written in matrix form as 


b, where 




/ «u 

a l2 " 

‘ a ln 

A = 

a 21 

a 22 

°2 n 


\ a nl 

a n2 

■■ a nnj 


16 This is probably because designers like “orthogo¬ 
nal” instruction sets, where the precisions of a 
floating-point instruction are independent of the 
actual operation. Making a special case for multi¬ 
plication destroys this orthogonality. 


Suppose a solution x (1) is computed by 
some method, perhaps Gaussian elimina¬ 
tion. There is a simple way to improve 
the accuracy of the result called iterative 
improvement. First compute 

£ = Ax w - b. (12) 

Then solve the system 

Ay = £. (13) 

Note that if x (1) is an exact solution, 
then £ is the zero vector, as is y. 

In general, the computation of £ and y 
will incur rounding error, so Ay * £ = 
Ax (1) — b = A(x (1) — x), where x is the 
(unknown) true solution. Then y ~ 

x (1) - x, so an improved estimate for the 
solution is 

x <2) = x m — y. (14) 

The three steps (12), (13), and (14) can be 
repeated, replacing x (1) with x (2) , and 
x (2) with x i3> . This argument that x t! + 1) 
is more accurate than x (0 is only infor¬ 
mal. For more information, see Golub 
and Van Loan [1989]. 

When performing iterative improve¬ 
ment, £ is a vector whose elements are 
the difference of nearby inexact floating¬ 
point numbers and so can suffer from 
catastrophic cancellation. Thus, iterative 
improvement is not very useful unless 
£ = Ax (1) - b is computed in double pre¬ 
cision. Once again, this is a case of com¬ 
puting the product of two single-precision 
numbers (A and x (1) ), where the full 
double-precision result is needed. 

To summarize, instructions that multi¬ 
ply two floating-point numbers and re¬ 
turn a product with twice the precision of 
the operands make a useful addition to a 
floating-point instruction set. Some of the 
implications of this for compilers are dis¬ 
cussed in the next section. 

3.2 Languages and Compilers 

The interaction of compilers and floating 
point is discussed in Farnum [1988], and 
much of the discussion in this section is 
taken from that paper. 
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3.2.1 Ambiguity 

Ideally, a language definition should de¬ 
fine the semantics of the language pre¬ 
cisely enough to prove statements about 
programs. Whereas this is usually true 
for the integer part of a language, lan¬ 
guage definitions often have a large gray 
area when it comes to floating point 
(modula-3 is an exception [Nelson 1991]). 
Perhaps this is due to the fact that many 
language designers believe that nothing 
can be proven about floating point, since 
it entails rounding error. If so, the previ¬ 
ous sections have demonstrated the fal¬ 
lacy in this reasoning. This section 
discusses some common gray areas in 
language definitions and gives sugges¬ 
tions about how to deal with them. 

Remarkably enough, some languages 
do not clearly specify that if x is a float¬ 
ing-point variable (with say a value of 
3.0/10.0), then every occurrence of (say) 
10.0 *x must have the same value. For 
example Ada, 17 which is based on 
Brown’s model, seems to imply that 
floating-point arithmetic only has to sat¬ 
isfy Brown’s axioms, and thus expres¬ 
sions can have one of many possible 
values. Thinking about floating point in 
this fuzzy way stands in sharp contrast 
to the IEEE model, where the result of 
each floating-point operation is precisely 
defined. In the IEEE model, we can prove 
that (3.0/10.0) * 3.0 evaluates to 3 (Theo¬ 
rem 7). In Brown’s model, we cannot. 

Another ambiguity in most language 
definitions concerns what happens on 
overflow, underflow, and other excep¬ 
tions. The IEEE standard precisely spec¬ 
ifies the behavior of exceptions, so 
languages that use the standard as a 
model can avoid any ambiguity on this 
point. 

Another gray area concerns the inter¬ 
pretation of parentheses. Due to roundoff 
errors, the associative laws of algebra do 
not necessarily hold for floating-point 


1 Ada is a registered trademark of the U S. Govern¬ 
ment Ada joint program office 


numbers. For example, the expression 
(x + y) + z has a totally different answer 
than x + (y + z) when x = 10 30 , 
y — — 10 30 , and z = 1 (it is 1 in the for¬ 
mer case, 0 in the latter). The impor¬ 
tance of preserving parentheses cannot 
be overemphasized. The algorithms pre¬ 
sented in Theorems 3, 4, and 6 all depend 
on it. For example, in Theorem 6, the 
formula x h = mx — (mx — x) would re¬ 
duce to x h = x if it were not for paren¬ 
theses, thereby destroying the entire 
algorithm. A language definition that 
does not require parentheses to be 
honored is useless for floating-point 
calculations. 

Subexpression evaluation is impre¬ 
cisely defined in many languages. Sup¬ 
pose ds is double precision, but x and y 
are single precision. Then in the expres¬ 
sion ds + x * y, is the product performed 
in single or double precision? Here is 
another example: In x-fm/n where m 
and n are integers, is the division an 
integer operation or a floating-point one? 
There are two ways to deal with this 
problem, neither of which is completely 
satisfactory. The first is to require that 
all variables in an expression have the 
same type. This is the simplest solution 
but has some drawbacks. First, lan¬ 
guages like Pascal that have subrange 
types allow mixing subrange variables 
with integer variables, so it is somewhat 
bizarre to prohibit mixing single- and 
double-precision variables. Another prob¬ 
lem concerns constants. In the expres¬ 
sion 0.1 *x, most languages interpret 0.1 
to be a single-precision constant. Now 
suppose the programmer decides to 
change the declaration of all the 
floating-point variables from single to 
double precision. If 0.1 is still treated as 
a single-precision constant, there will be 
a compile time error. The programmer 
will have to hunt down and change every 
floating-point constant. 

The second approach is to allow mixed 
expressions, in which case rules for 
subexpression evaluation must be pro¬ 
vided. There are a number of guiding 
examples. The original definition of C 
required that every floating-point expres- 
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sion be computed in double precision 
[Kernighan and Ritchie 1978]. This leads 
to anomalies like the example immedi¬ 
ately proceeding Section 3.1. The expres¬ 
sion 3.0/7.0 is computed in double 
precision, but if q is a single-precision 
variable, the quotient is rounded to sin¬ 
gle precision for storage. Since 3/7 is a 
repeating binary fraction, its computed 
value in double precision is different from 
its stored value in single precision. Thus, 
the comparison q = 3/7 fails. This sug¬ 
gests that computing every expression in 
the highest precision available is not a 
good rule. 

Another guiding example is inner 
products. If the inner product has thou¬ 
sands of terms, the rounding error in the 
sum can become substantial. One way to 
reduce this rounding error is to accumu¬ 
late the sums in double precision (this 
will be discussed in more detail in Sec¬ 
tion 3.2.3). If d is a double-precision 
variable, and x[] and y[] are single preci¬ 
sion arrays, the inner product loop will 
look like d = d +x[i] *y[i]. If the multi¬ 
plication is done in single precision, much 
of the advantage of double-precision ac¬ 
cumulation is lost because the product is 
truncated to single precision just before 
being added to a double-precision 
variable. 

A rule that covers the previous two 
examples is to compute an expression in 
the highest precision of any variable that 
occurs in that expression. Then q = 
3.0/7.0 will be computed entirely in sin¬ 
gle precision 18 and will have the Boolean 
value true, whereas d = d + x[i] * y[i] 
will be computed in double precision, 
gaining the full advantage of double-pre¬ 
cision accumulation. This rule is too sim¬ 
plistic, however, to cover all cases 
cleanly. If dx and dy are double-preci¬ 
sion variables, the expression y = x + 
single(dx — dy) contains a double-preci¬ 
sion variable, but performing the sum in 
double precision would be pointless be- 


18 This assumes the common convention that 3.0 is 
a single-precision constant, whereas 3.0D0 is a dou¬ 
ble-precision constant. 


cause both operands are single precision, 
as is the result. 

A more sophisticated subexpression 
evaluation rule is as follows. First, as¬ 
sign each operation a tentative precision, 
which is the maximum of the precision of 
its operands. This assignment has to be 
carried out from the leaves to the root of 
the expression tree. Then, perform a sec¬ 
ond pass from the root to the leaves. In 
this pass, assign to each operation the 
maximum of the tentative precision and 
the precision expected by the parent. In 
the case of q = 3.0/7.0, every leaf is sin¬ 
gle precision, so all the operations are 
done in single precision. In the case of 
d = d + x[i] * y[i], the tentative precision 
of the multiply operation is single preci¬ 
sion, but in the second pass it gets pro¬ 
moted to double precision because its 
parent operation expects a double-preci¬ 
sion operand. And in y = x + single 
(dx - dy), the addition is done in single 
precision. Farnum [1988] presents evi¬ 
dence that this algorithm is not difficult 
to implement. 

The disadvantage of this rule is that 
the evaluation of a subexpression de¬ 
pends on the expression in which it is 
embedded. This can have some annoying 
consequences. For example, suppose you 
are debugging a program and want to 
know the value of a subexpression. You 
cannot simply type the subexpression to 
the debugger and ask it to be evaluated 
because the value of the subexpression in 
the program depends on the expression 
in which it is embedded. A final com¬ 
ment on subexpression is that since con¬ 
verting decimal constants to binary is an 
operation, the evaluation rule also af¬ 
fects the interpretation of decimal con¬ 
stants. This is especially important for 
constants like 0.1, which are not exactly 
representable in binary. 

Another potential gray area occurs 
when a language includes exponentia¬ 
tion as one of its built-in operations. Un¬ 
like the basic arithmetic operations, the 
value of exponentiation is not always ob¬ 
vious [Kahan and Coonen 1982], If * * 
is the exponentiation operator, then 
(-3)* *3 certainly has the value -27. 
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However, (— 3.0) * * 3.0 is problematical. 
If the * * operator checks for integer pow¬ 
ers, it would compute (- 3.0) * * 3.0 as 
— 3.0 3 = —27. On the other hand, if the 
formula x y = e- vloe x is used to define * * 
for real arguments, then depending on 
the log function, the result could be a 
NaN (using the natural definition of 
log(x) = NaN when x < 0). If the FOR¬ 
TRAN CLOG function is used, however, 
the answer will be — 27 because the ANSI 
FORTRAN standard defines CLOG 
(-3.0) to be iv log 3 [ANSI 1978], The 
programming language Ada avoids this 
problem by only defining exponentia¬ 
tion for integer powers, while ANSI 
FORTRAN prohibits raising a negative 
number to a real power. 

In fact, the FORTRAN standard says 
that 

Any arithmetic operation whose result is not 

mathematically defined is prohibited. .. 

Unfortunately, with the introduction 
of ± oo by the IEEE standard, the mean¬ 
ing of not mathematically defined is no 
longer totally clear cut. One definition 
might be to use the method of Section 
2.2.2. For example, to determine the 
value of a b , consider nonconstant ana¬ 
lytic functions f and g with the property 
that f(x) -* a and g(x) -*■ b as x -> 0. If 
f(x) g(x) always approaches the same 
limit, this should be the value of a b . This 
definition would set 2 00 = oo, which seems 
quite reasonable. In the case of 1.0°°, 
when f(x) = 1 and g(x) = 1/x the limit 
approaches 1, but when f(x) = 1 - x and 
g(x) = 1/x the limit is e. So 1.0 00 should 
be an NaN. In the case of 0°, /'(x) #( * ) = 
e g(x)iog /(•*)_ gi nce f and g are analytical 
and take on the value of 0 at 0, f{x) = 
a 1 x 1 + a 2 x 2 + ■ ■ ■ and g(x) = b 1 x 1 4- 
b 2 x 2 + • • • . Thus, 

limg(x)log f( x) 

= lim xlog( x(a 1 + a 2 x + • • • )) 

= limxlog(a 1 x) = 0. 

x—*-0 

So f{x) 8(x) -*■ e° = 1 for all f and g, 


which means 0° = l. 19 Using this defini¬ 
tion would unambiguously define the ex¬ 
ponential function for all arguments and 
in particular would define (- 3.0) * * 3.0 
to be - 27. 

3.2.2 IEEE Standard 

Section 2 discussed many of the features 
of the IEEE standard. The IEEE stan¬ 
dard, however, says nothing about how 
these features are to be accessed from a 
programming language. Thus, there is 
usually a mismatch between floating¬ 
point hardware that supports the stan¬ 
dard and programming languages like C, 
Pascal, or FORTRAN. Some of the IEEE 
capabilities can be accessed through a 
library of subroutine calls. For example, 
the IEEE standard requires that square 
root be exactly rounded, and the square 
root function is often implemented di¬ 
rectly in hardware. This functionality is 
easily accessed via a library square root 
routine. Other aspects of the standard, 
however, are not so easily implemented 
as subroutines. For example, most com¬ 
puter languages specify at most two 
floating-point types, whereas, the IEEE 
standard has four different precisions (al¬ 
though the recommended configurations 
are single plus single extended or single, 
double, and double extended). Infinity 
provides another example. Constants to 
represent ±oo could be supplied by a 
subroutine. But that might make them 
unusable in places that require constant 
expressions, such as the initializer of a 
constant variable. 

A more subtle situation is manipulat¬ 
ing the state associated with a computa¬ 
tion, where the state consists of the 
rounding modes, trap enable bits, trap 
handlers, and exception flags. One ap¬ 
proach is to provide subroutines for read¬ 
ing and writing the state. In addition, a 


19 The conclusion that 0° = 1 depends on the re¬ 
striction f be nonconstant. If this restriction is 
removed, then letting f be the identically 0 func¬ 
tion gives 0 as a possible value for lim 
and so 0° would have to be defined to be a NaN. 
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single call that can atomically set a new 
value and return the old value is often 
useful. As the examples in Section 2.3.3 
showed, a common pattern of modifying 
IEEE state is to change it only within 
the scope of a block or subroutine. Thus, 
the burden is on the programmer to find 
each exit from the block and make sure 
the state is restored. Language support 
for setting the state precisely in the scope 
of a block would be very useful here. 
Modula-3 is one language that imple¬ 
ments this idea for trap handlers 
[Nelson 1991]. 

A number of minor points need to be 
considered when implementing the IEEE 
standard in a language. Since x - x = 
+ 0 for all x, 20 ( + 0) - ( + 0) = +0. How¬ 
ever, -( + 0) = -0, thus -x should not 
be defined as 0 — x. The introduction of 
NaNs can be confusing because an NaN 
is never equal to any other number (in¬ 
cluding another NaN), so x = x is no 
longer always true. In fact, the expres¬ 
sion x ± x is the simplest way to test for 
a NaN if the IEEE recommended func¬ 
tion Isnan is not provided. Furthermore, 
NaNs are unordered with respect to all 
other numbers, so x < y cannot be de¬ 
fined as not x > y. Since the intro¬ 
duction of NaNs causes floating-point 
numbers to become partially ordered, a 
compare function that returns one of 
< , = , > , or unordered can make it 
easier for the programmer to deal with 
comparisons. 

Although the IEEE standard defines 
the basic floating-point operations to re¬ 
turn a NaN if any operand is a NaN, this 
might not always be the best definition 
for compound operations. For example, 
when computing the appropriate scale 
factor to use in plotting a graph, the 
maximum of a set of values must be 
computed. In this case, it makes sense 
for the max operation simply to ignore 
NaNs. 

Finally, rounding can be a problem. 
The IEEE standard defines rounding pre- 


20 Unless the rounding mode is round toward — <x>, 
in which case x — x = - 0. 


cisely, and it depends on the current 
value of the rounding modes. This some¬ 
times conflicts with the definition of im¬ 
plicit rounding in type conversions or the 
explicit round function in languages. 
This means that programs that wish 
to use IEEE rounding cannot use the 
natural language primitives, and con¬ 
versely the language primitives will be 
inefficient to implement on the ever- 
increasing number of IEEE machines. 

3.2.3 Optimizers 

Compiler texts tend to ignore the subject 
of floating point. For example, Aho et al. 
[1986] mentions replacing x/2.0 with 
x * 0.5, leading the reader to assume that 
x/10.0 should be replaced by 0.1 *x. 
These two expressions do not, however, 
have the same semantics on a binary 
machine because 0.1 cannot be repre¬ 
sented exactly in binary. This textbook 
also suggests replacing x*y-x*z by 
x * (y - z), even though we have seen that 
these two expressions can have quite dif¬ 
ferent values when y ~ z. Although it 
does qualify the statement that any alge¬ 
braic identity can be used when optimiz¬ 
ing code by noting that optimizers should 
not violate the language definition, it 
leaves the impression that floating-point 
semantics are not very important. 
Whether or not the language standard 
specifies that parenthesis must be hon¬ 
ored, (xfy)-f z can have a totally differ¬ 
ent answer than x + (y -f z), as discussed 
above. 

There is a problem closely related to 
preserving parentheses that is illus¬ 
trated by the following code: 

eps = 1 

do eps = 0.5 * eps while (eps + 1 > 1) 

This code is designed to give an estimate 
for machine epsilon. If an optimizing 
compiler notices that eps + 1 > 1 eps 
> 0, the program will be changed com¬ 
pletely. Instead of computing the small¬ 
est number x such that 1 © x is still 
greater than x(x « e « (3~ p ), it will com¬ 
pute the largest number x for which x /2 
is rounded to 0 (x = /3 emm ). Avoiding this 
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kind of “optimization” is so important 
that it is worth presenting one more use¬ 
ful algorithm that is totally ruined by it. 

Many problems, such as numerical in¬ 
tegration and the numerical solution of 
differential equations, involve computing 
sums with many terms. Because each 
addition can potentially introduce an er¬ 
ror as large as 1/2 ulp, a sum involving 
thousands of terms can have quite a bit 
of rounding error. A simple way to cor¬ 
rect for this is to store the partial sum¬ 
mand in a double-precision variable and 
to perform each addition using double 
precision. If the calculation is being done 
in single precision, performing the sum 
in double precision is easy on most com¬ 
puter systems. If the calculation is al¬ 
ready being done in double precision, 
however, doubling the precision is not so 
simple. One method that is sometimes 
advocated is to sort the numbers and add 
them from smallest to largest. There is a 
much more efficient method, however, 
that dramatically improves the accuracy 
of sums, namely Theorem 8. 

Theorem 8 (Kahan Summation Formula) 

Suppose i x is computed using the 
following algorithm 

S = X[l] 

C = 0 

for j = 2 to N { 

Y = Xffl - C 

T = S+Y 

C = (T - S) - Y 

S = T 

} 

Then the computed sum S is equal to 
Yxy(T + 5 7 ) + 0(Ne 2 )J2 | Xj\, where \ bj \ 
< 2e. 

Using the naive formula Ex ; , the com¬ 
puted sum is equal to E^(l + 5 ) where 
| bj | < (n — j)e. Comparing this with the 
error in the Kahan summation form¬ 
ula shows a dramatic improvement. 
Each summand is perturbed by only 2e 
instead of perturbations as large as ne 
in the simple formula. Details are in 
Section 4.3. 


An optimizer that believed floating¬ 
point arithmetic obeyed the laws of alge¬ 
bra would conclude that C = [T - S] - 
Y = [(S + Y) — S] - Y = 0, rendering 
the algorithm completely useless. These 
examples can be summarized by saying 
that optimizers should be extremely cau¬ 
tious when applying algebraic identities 
that hold for the mathematical real num¬ 
bers to expressions involving floating¬ 
point variables. 

Another way that optimizers can 
change the semantics of floating-point 
code involves constants. In the expres¬ 
sion 1.0E-40 * x, there is an implicit dec¬ 
imal to binary conversion operation that 
converts the decimal number to a binary 
constant. Because this constant cannot 
be represented exactly in binary, the in¬ 
exact exception should be raised. In addi¬ 
tion, the underflow flag should to be set 
if the expression is evaluated in single 
precision. Since the constant is inexact, 
its exact conversion to binary depends on 
the current value of the IEEE rounding 
modes. Thus, an optimizer that converts 
1.0E-40 to binary at compile time would 
be changing the semantics of the pro¬ 
gram. Constants like 27.5, however, that 
are exactly representable in the smallest 
available precision can be safely con¬ 
verted at compile time, since they are 
always exact, cannot raise any exception, 
and are unaffected by the rounding 
modes. Constants that are intended to be 
converted at compile time should be done 
with a constant declaration such as const 
pi = 3.14159265. 

Common subexpression elimination is 
another example of an optimization that 
can change floating-point semantics, as 
illustrated by the following code: 

C = A*B; 

RndMode = Up 
D = A * B; 

Although A * B may appear to be a com¬ 
mon subexpression, it is not because the 
rounding mode is different at the two 
evaluation sites. Three final examples 
are x = x cannot be replaced by the 
Boolean constant true, because it fails 
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when x is an NaN; —x - 0 - x fails for 
x = + 0; and x < y is not the opposite of 
x > y, because NaNs are neither greater 
than nor less than ordinary floating-point 
numbers. 

Despite these examples, there are use¬ 
ful optimizations that can be done on 
floating-point code. First, there are alge¬ 
braic identities that are valid for float¬ 
ing-point numbers. Some examples in 
IEEE arithmetic are x + y = y + x, 2 X 
x = x + x, 1 X x = x, and 0.5 X x = x/2. 
Even these simple identities, however, 
can fail on a few machines such as CDC 
and Cray supercomputers. Instruction 
scheduling and inline procedure substi¬ 
tution are two other potentially useful 
optimizations. 21 As a final example, con¬ 
sider the expression dx = x * y, where x 
and y are single precision variables and 
dx is double precision. On machines that 
have an instruction that multiplies two 
single-precision numbers to produce a 
double-precision number, dx = x * y can 
get mapped to that instruction rather 
than compiled to a series of instructions 
that convert the operands to double then 
perform a double-to-double precision 
multiply. 

Some compiler writers view restric¬ 
tions that prohibit converting (x + y) A z 
to x + (y + z) as irrelevant, of interest 
only to programmers who use unportable 
tricks. Perhaps they have in mind that 
floating-point numbers model real num¬ 
bers and should obey the same laws real 
numbers do. The problem with real num¬ 
ber semantics is that they are extremely 
expensive to implement. Every time two 
n bit numbers are multiplied, the prod¬ 
uct will have 2 n bits. Every time two n 
bit numbers with widely spaced expo¬ 
nents are added, the sum will have 2 n 
bits. An algorithm that involves thou¬ 
sands of operations (such as solving a 
linear system) will soon be operating on 
huge numbers and be hopelessly slow. 


21 The VMS math libraries on the VAX use a weak 
form of inline procedure substitution in that they 
use the inexpensive jump to subroutine call rather 
than the slower CALLS and CALLG instructions. 


The implementation of library functions 
such as sin and cos is even more difficult, 
because the value of these transcenden¬ 
tal functions are not rational numbers. 
Exact integer arithmetic is often pro¬ 
vided by Lisp systems and is handy for 
some problems. Exact floating-point 
arithmetic is, however, rarely useful. 

The fact is there are useful algorithms 
(like the Kahan summation formula) that 
exploit (x + y) + z =£ x + (y + z), and 
work whenever the bound 

a 0 b = (a + 6)(1 + 5) 

holds (as well as similar bounds for —, 
X, and /). Since these bounds hold for 
almost all commercial hardware not just 
machines with IEEE arithmetic, it would 
be foolish for numerical programmers to 
ignore such algorithms, and it would be 
irresponsible for compiler writers to de¬ 
stroy these algorithms by pretending that 
floating-point variables have real num¬ 
ber semantics. 

3.3 Exception Handling 

The topics discussed up to now have pri¬ 
marily concerned systems implications of 
accuracy and precision. Trap handlers 
also raise some interesting systems is¬ 
sues. The IEEE standard strongly recom¬ 
mends that users be able to specify a trap 
handler for each of the five classes of 
exceptions, and Section 2.3.1 gave some 
applications of user defined trap han¬ 
dlers. In the case of invalid operation 
and division by zero exceptions, the han¬ 
dler should be provided with the 
operands, otherwise with the exactly 
rounded result. Depending on the pro¬ 
gramming language being used, the trap 
handler might be able to access other 
variables in the program as well. For all 
exceptions, the trap handler must be able 
to identify what operation was being 
performed and the precision of its 

destination. 

The IEEE standard assumes that oper¬ 
ations are conceptually serial and that 
when an interrupt occurs, it is possible to 
identify the operation and its operands. 
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On machines that have pipelining or 
multiple arithmetic units, when an ex¬ 
ception occurs, it may not be enough sim¬ 
ply to have the trap handler examine the 
program counter. Hardware support for 
identifying exactly which operation 
trapped may be necessary. 

Another problem is illustrated by the 
following program fragment: 

x = y * z 
z = x * w 
a = b + c 
d = a/x 

Suppose the second multiply raises an 
exception, and the trap handler wants to 
use the value of a. On hardware that can 
do an add and multiply in parallel, 
an optimizer would probably move the 
addition operation ahead of the second 
multiply, so that the add can proceed in 
parallel with the first multiply. Thus, 
when the second multiply traps, a = b + 
c has already been executed, potentially 
changing the result of a. It would not be 
reasonable for a compiler to avoid this 
kind of optimization because every float¬ 
ing-point operation can potentially trap, 
and thus virtually all instruction 
scheduling optimizations would be elimi¬ 
nated. This problem can be avoided by 
prohibiting trap handlers from accessing 
any variables of the program directly. 
Instead, the handler can be given the 
operands or result as an argument. 

But there are still problems. In the 
fragment 

x = y *z 
z = a + b 

the two instructions might well be exe¬ 
cuted in parallel. If the multiply traps, 
its argument z could already have been 
overwritten by the addition, especially 
since addition is usually faster than mul¬ 
tiply. Computer systems that support 
trap handlers in the IEEE standard must 
provide some way to save the value of z, 
either in hardware or by having the 
compiler avoid such a situation in the 
first place. 

Kahan has proposed using presubstitu¬ 
tion instead of trap handlers to avoid 
these problems. In this method, the user 


specifies an exception and a value to be 
used as the result when the exception 
occurs. As an example, suppose that in 
code for computing sin x/x, the user de¬ 
cides that x = 0 is so rare that it would 
improve performance to avoid a test for 
x = 0 and instead handle this case when 
a 0/0 trap occurs. Using IEEE trap han¬ 
dlers, the user would write a handler 
that returns a value of 1 and installs it 
before computing sin x/ x. Using presub¬ 
stitution, the user would specify that 
when an invalid operation occurs, the 
value of 1 should be used. Kahan calls 
this presubstitution because the value to 
be used must be specified before the ex¬ 
ception occurs. When using trap han¬ 
dlers, the value to be returned can be 
computed when the trap occurs. 

The advantage of presubstitution is 
that it has a straightforward hardware 
implementation. As soon as the type of 
exception has been determined, it can be 
used to index a table that contains the 
desired result of the operation. Although 
presubstitution has some attractive at¬ 
tributes, the widespread acceptance of the 
IEEE standard makes it unlikely to be 
widely implemented by hardware manu¬ 
facturers. 

4. DETAILS 

Various claims have been made in this 
paper concerning properties of floating¬ 
point arithmetic. We now proceed to 
show that floating point is not black 
magic, but rather a straightforward 
subject whose claims can be verified 
mathematically. 

This section is divided into three parts. 
The first part represents an introduction 
to error analysis and provides the details 
for Section 1. The second part explores 
binary-to-decimal conversion, filling in 
some gaps from Section 2. The third 
part discusses the Kahan summation 
formula, which was used as an example 
in Section 3. 

4.1 Rounding Error 

In the discussion of rounding error, it 
was stated that a single guard digit is 
enough to guarantee that addition and 
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subtraction will always be accurate (The¬ 
orem 2). We now proceed to verify this 
fact. Theorem 2 has two parts, one for 
subtraction and one for addition. The part 
for subtraction is as follows: 

Theorem 9 

If x and y are positive floating-point num¬ 
bers in a format with parameters fi and p 
and if subtraction is done with p + 1 dig¬ 
its (i.e., one guard digit), then the rela¬ 
tive rounding error in the result is less 
than K/3/2) + l]/3~ p = [1 + (2/(3)]e < 2e. 

Proof. Interchange x and y is neces¬ 
sary so that x > y. It is also harmless to 
scale x and y so that x is represented by 
x 0 .x 1 ■■■ x p _ 1 x /3°. If y is represented 
as y 0 .y 1 • ■ • y p _i, then the difference is 
exact. If y is represented as 0.y 1 • • • y p , 
then the guard digit ensures that the 
computed difference will be the exact dif¬ 
ference rounded to a floating-point num¬ 
ber, so the rounding error is at most e. In 
general,_ let y = 0.0 • • ■ 0y* + 1 • • • y k+p 
and let y be y truncated to p + 1 digits. 
Then, 

y-y 

< (0 - i)(/3- p - 1 ••• +p~ p ~ k ). 

(15) 

From the definition of guard digit, the 
computed value of x - y is x - y 
rounded to be a floating-point number; 
that is, (x — y) + 5, where the rounding 
error 5 satisfies 

I5l<(^)r p . ( 16 ) 

The exact difference is x — y, so the er¬ 
ror is {x - y) - (x - y + b) = y - y + b. 
There are three cases. If x - y > 1, the 
relative error is bounded by 

y - y + <5 

l 

< p-p (i3 - 1 )(r* + ••• +/s-*) + ~ 

z 

</3- p (l + ^). (17) 


Second, if x — y < 1, then 5 = 0. Since 
the smallest that x — y can be is 

k k 



> ( 0 - i )(r 1 + ■■■ + 0 -*) 

(where q = $ - 1), in this case the rela¬ 
tive error is bounded by 

y - y + 5 

(0 - i)(0 _1 + ••• +d“ A ) 

_ (d - l)0- p (0- x + ••• +0-*) 

< (p - l^/T 1 + ••• +p- k ) 

= p- p . (18) 

The final case is when x — y < 1 but 
x - y > 1. The only way this could hap¬ 
pen is if x - y = 1, in which case 5 = 0. 
But if 6 = 0, then (18) applies, so again 
the relative error is bounded by /3~ p < 

r p (i + d/2). ■ 

When /3 = 2, the bound is exactly 
2e, and this bound is achieved for x = 
1 + 2 2 - p and y = 2 1 ~ p - 2 l ~ 2p in the 
limit as p-* oo. When adding numbers of 
the same sign, a guard digit is not neces¬ 
sary to achieve good accuracy, as the 
following result shows. 

Theorem 10 

If x> 0 and y > 0, the relative error in 
computing x + y is at most 2e, even if no 
guard digits are used. 

Proof. The algorithm for addition 
with k guard digits is similar to the 
algorithm for subtraction. If x > y, and 
shift y right until the radix points of x 
and y are aligned. Discard any digits 
shifted past the p + k position. Compute 
the sum of these two p + k digit num¬ 
bers exactly. Then round to p digits. 

We will verify the theorem when no 
guard digits are used; the general case is 
similar. There is no loss of generality in 
assuming that x > y > 0 and that x is 
scaled to be of the form d.d • • • d x (3°. 
First, assume there is no carry out. Then 
the digits shifted off the end of y have a 
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value less than /3 -p+1 and the sum is at 
least 1, so the relative error is less than 
fi~ p+1 /l = 2e. If there is a carry out, the 
error from shifting must be added to the 
rounding error of (l/2)(3~ p+2 . The sum is 
at least /3, so the relative error is less 
than ( p- p+1 + (l/2)d“ p+2 )//3 = (1 + 
p/2)(3- p < 2e. U 

It is obvious that combining these two 
theorems gives Theorem 2. Theorem 2 
gives the relative error for performing 
one operation. Comparing the rounding 
error of x 2 - y 2 and (x + y)(x — y) re¬ 
quires knowing the relative error of mul¬ 
tiple operations. The relative error of 
x © y is = [(x © y) — (x — y) 1/ 
(x - y), which satisfies | | < 2e. Or to 

write it another way, 

x © y = (x - y)(l + 6 t ), |5 1 |<2e. 

(19) 

Similarly, 

x ® y = (x + y)(l + S 2 ), | 6 2 | < 2e. 

(20) 

Assuming that multiplication is per¬ 
formed by computing the exact product 
then rounding, the relative error is at 
most 1/2 ulp, so 

u <8> v = «r(l + 5 S ), | S 3 | < e (21) 

for any floating point numbers u and v. 
Putting these three equations together 
(letting u = x © y and v = x © y) gives 

(x©y)®(x©y) 

= (* - t)(i + 

x(x+y)(l + a 2 )(l + S 3 ). (22) 

So the relative error incurred when com¬ 
puting (x — y)(x + y) is 

(x © y) <8> (x © y) — (x 2 — y 2 ) 

U 2 -y 2 ) 

= (1+«!)(! +fi 2 )(l +fi 8 ) -1. (23) 


This relative error is equal to 5 1 + d 2 + 
8 3 + 5 1 5 2 + 5 1 5 3 + 8 2 8 3 , which is bounded 
by 5e + 8e 2 . In other words, the maxi¬ 
mum relative error is about five round¬ 
ing errors (since e is a small number, e 2 
is almost negligible). 

A similar analysis of (x <8> x) © 
(y<S>y) cannot result in a small value for 
the relative error because when two 
nearby values of x and y are plugged 
into x 2 — y' 2 , the relative error will usu¬ 
ally be quite large. Another way to see 
this is to try and duplicate the analysis 
that worked on (x © y) <8> (x © y), 
yielding 

(x ® x) © (y®y) 

= [ x 2 (l + 5- l ) - y 2 (l + < 5 2 )] (1 + $ 3 ) 
= ((* 2 - y 2 )( 1 + fix) + ( 5 X - <? 2 )y 2 ) 

(1 + 63). 

When x and y are nearby, the error 
term (S 1 - <5 2 )y 2 can be as large as the 
result x 2 - y 2 . These computations for¬ 
mally justify our claim that (x - y) 
(x + y) is more accurate than x 2 - y 2 . 

We next turn to an analysis of the 
formula for the area of a triangle. To 
estimate the maximum error that can 
occur when computing with (7), the fol¬ 
lowing fact will be needed. 

Theorem 11 

If subtraction is performed with a guard 
digit and y/2 < x < 2y, then x — y is 
computed exactly. 

Proof. Note that if x and y have the 
same exponent, then certainly x © y is 
exact. Otherwise, from the condition of 
the theorem, the exponents can differ by 
at most 1. Scale and interchange x and y 
if necessary so 0 < y < x and x is repre¬ 
sented as x 0 .x 1 ■ ■ ■ x p _ 1 and y as O.yj 
• • • y p . Then the algorithm for comput¬ 
ing x © y will compute x — y exactly 
and round to a floating-point number but 
if the difference is of the form 0 .d l 
d p , the difference will already be p digits 
long, and no rounding is necessary. Since 
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x < 2 y, x - y < y, and since y is of the 
form O.c^ • • • d p , so is x - y. ■ 

When (3 > 2, the hypothesis of Theo¬ 
rem 11 cannot be replaced by y / (3 < x < 
13y; the stronger condition y/2 < x < 2y 
is still necessary. The analysis of the 
error in (x — y)( x + y) in the previous 
section used the fact that the relative 
error in the basic operations of addition 
and subtraction is small [namely, eqs. 
(19) and (20)]. This is the most common 
kind of error analysis. Analyzing for¬ 
mula (7), however, requires something 
more; namely, Theorem 11, as the follow¬ 
ing proof will show. 

Theorem 12 

If subtraction uses a guard digit and if a, 
b, and c are the sides of a triangle, the 
relative error in computing (a + (b + 
c))(c - (a - b))(c + (a - b))(a + (b - c)) 
is at most 16 e, provided e < .005. 

Proof. Let us examine the factors one 
by one. From Theorem 10, b ® c = 
{b + c)(l + 5 1 ), where is the relative 
error and | 5 1 1 < 2e. Then the value of 
the first factor is (a 0 (b 0 c)) = (a + 
(b 0 c))(l + 8 2 ) = (a + (b + c)(l + «!)) 
X (1 + <5 2 ), and thus 

(cz + b + c)(l — 2e) 

< [a + (b + c)(l - 2e)](l - 2e) 

< a 0 (6 0 c) 

[ o + (6 + c) (l + 2e)] (1 + 2e) 

< (a + b + c)(l + 2e) 2 . 

This means that there is an ri 1 so that 
(o 0 (b 0 c)) = (a + b 4- c)(l + ?j 1 ) 2 , 

|r?i|<2e. (24) 

The next term involves the potentially 
catastrophic subtraction of c and a © b, 
because a © b may have rounding er¬ 
ror. Because a, b, and c are the sides of a 
triangle, a < b + c, and combining this 
with the ordering c < b < a gives a < b 
+ c<26<2a. So a — b satisfies the 


conditions of Theorem 11. This means 
a - b = a © b is exact, and hence c © 
(a - b) is a harmless subtraction that 
can be estimated from Theorem 9 to be 

(c © (a © 6)) = (c - (a - 6))(l + r/ 2 ), 

I 42 I — 2e. (25) 

The third term is the sum of two exact 
positive quantities, so 

(c © (a © b )) = (c + (a - 6))(1 + z/ 3 ), 
1| < 2e. (26) 

Finally, the last term is 

{a © {b © c)) = (a + (b - c))(l + r, 4 ) 2 , 

I 44 I =£ 26, (27) 

using both Theorem 9 and Theorem 10. 
If multiplication is assumed to be exactly 
rounded so that x ® y = xy( 1 + f) with 
| f | < e, then combining (24), (25), (26), 
and (27) gives 

(a 0 (b © c))(c © (a © 6)) 

(c © (a © 6))(a © (6 © c)) 

< (a + (b + c))(c - (a - 5)) 

(c + (a - 6)) 

(a + (b - c))E, 

where 

E = (1 + 4i) 2 (l + 4 2 )(l + 4 3 )( 1 + 4 4 ) 2 
(1 + fl)(l + f2)(l + f 3 )- 

An upper bound for E is (1 + 2e) 6 (l + 
e) 3 , which expands to 1 + 15e 4- 0(e 2 ). 
Some writers simply ignore the 0(e 2 ) 
term, but it is easy to account for it. 
Writing (1 + 2e) 6 (l + e) 3 = 1 + 15e + 
eR(e), R(e) is a polynomial in e with 
positive coefficients, so it is an increasing 
function of e. Since i?(.005) = .505, i?(e) 
< 1 for all e < .005, and hence E < (1 + 
2e) 6 (l + e) 3 < 1 + 16e. To get a lower 
bound on E, note that 1 — 15e — eR(e) < 
E; so when e < .005, 1 - 16 e < (1 — 
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2e) 6 (l - e) 3 . Combining these two 
bounds yields 1 - 16e < E < 1 + 16c. 
Thus the relative error is at most 16 e. 


Theorem 12 shows there is no catas¬ 
trophic cancellation in formula (7). 
Therefore, although it is not necessary to 
show formula (7) is numerically stable, it 
is satisfying to have a bound for the en¬ 
tire formula, which is what Theorem 3 of 
Section 1.4 gives. 

Proof Theorem 3. Let 

q = (a + (b + c))(c - (a - 6)) 

(c + (a - b))(a + (b — c)) 


Theorem 13 

If /x(x) = ln(l + x)/x, then for 0 < x < 
3/4, 1/2 < n(x) < 1 and the derivative 
satisfies | ju.'( jc) | < 1 /2. 

Proof. Note that n(x) = 1 — x/2 + 
x 2 /3 — • • • is an alternating series with 
decreasing terms, so for x < 1, n(x) > 1 
— x/2 > 1/2. It is even easier to see that 
because the series for g is alternating, 
g(x) < 1. The Taylor series of g\x) is 
also alternating, and if x < 3/4 has de¬ 
creasing terms, so -1/2 < g'(x) < -1/2 
+ 2x/3, or -1/2 s g'(x) < 0, thus 
\Ax)\<i/2. m 

Proof Theorem 4. Since the Taylor se¬ 
ries for In, 


and 

Q — (a © (b © c)) ® (c © (a © b)) 
<8>(c © (a © b)) ®(a © (b © e)). 

Then Theorem 12 shows that Q = q( 1 + 
5), with 5 < 16e. It is easy to check that 

1 - .52 | 5 | < y/1- | 8 | < 

< 1 + .52 | 8 | (28) 

provided 8 < .04/(.52) 2 * .15. Since | 8 | 

< 16c < 16(.005) = .08^5 does satisfy the 
condition. Thus, \/Q = [q(l + 5)] 1 / 2 
= \fq (1 + Si), with | 5 X | < .52 | 8 | < 
8.5e. If square roots are computed to 
within 1/2 ulp, the error when comput¬ 
ing t/Q is (1 + Sj)(l + <5 2 ), with J5 2 | — 
e. If j8 = 2, there is no further error com¬ 
mitted when dividing by 4. Otherwise, 
one more factor 1 + 5 3 with 5 3 | < e is 
necessary for the division, and using the 
method in the proof of Theorem 12, the 
final error bound of (1 + bjXl + 5 2 )(1 + 
<5 3 ) is dominated by 1 + 5 4 , with | 8 4 | < 
lie. ■ 


To make the heuristic explanation im¬ 
mediately following the statement of 
Theorem 4 precise, the next theorem de¬ 
scribes just how closely n(x) approxi¬ 
mates a constant. 


, , * 2 * 3 

ln(l + x) = x --1-- • • • 

2 3 

is an alternating series, 0 < x - ln(l + 
x) < x 2 /2. Therefore, the relative error 
incurred when approximating ln(l + x) 
by x is bounded by x/2. If 1 © x = 1, 
then | x | < e, so the relative error is 
bounded by e/2. 

When 1 © x T 1, define x via 1 © x 
= 1 + x. Then since 0 < x < 1, (1 © x) 
© 1 = x. If division and logarithms are 
computed to within 1/2 ulp, the com¬ 
puted value of the expression ln(l + 
x)/((l + x) - 1) is 


ln(l © x) 

(1 © x) © 1 

ln(l + x) 


(1+50(1+5,) 


(1 + 50(1 + 6 2 ) 


/x(x)(1+50(1 + 5 2 ), (29) 


where | 5j | < e and | S 2 | < e. To esti¬ 
mate g(x), use the mean value theorem, 
which says that 

n(x) - n(x) = (x - x)n'(t) (30) 

for some £ between x and x. From the 
definition of x, it follows that | x - x | < 
e. Combining this with Theorem 13 gives 
I Kx) - g(x) | < e/2 or | g(x)/g(x) - 11 
< e/(2 | /e(x) |) < e, which means g(x) = 
g{x){l + 5 3 ), with 15 3 1 <e. Finally, 
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multiplying by x introduces a final S 4 , so 
the computed value of xln(l + x)/((l + 
x) - 1) is 


xln(l + x) 
((1 + x) - 1) 


(1 + ^(l + S 2 ) 


X (1 + <5 3 )(1 + S 4 ), | 8, | <e. 


the fact that ax 2 + bx + c = a(x - r x )(x 
- r 2 ), so ar 1 r 2 = c. Since b 2 ~ 4 ac, then 
r x « r 2 , so the second error term is 4 acb 4 
~ 4 a 2 r 2 5 4 . Thus, the computed value of 
s/d is \/d + 4a 2 r 2 S 4 . The inequality 

p - q < \f p 2 - q 2 < V 7 P 2 + Q 2 
— P A q 


It is easy to check that if e < 0.1, then 

(1 + a x )( 1 + a 2 )(i + a 3 )(i + « 4 ) = 1 + a, 

with | a | < 5e. B 

An interesting example of error analy¬ 
sis using formulas (19), (20), and (21) 
occurs in the quadratic formula [ — b 
± s/b 2 - 4ac]/2a. Section 1.4 explained 
how rewriting the equation will elimi¬ 
nate the potential cancellation caused by 
the ± operation. But there is another 
potential cancellation that can occur 
when computing d = b 2 - 4ac. This one 
cannot be eliminated by a simple rear¬ 
rangement of the formula. Roughly 
speaking, when b 2 = 4ac, rounding error 
can contaminate up to half the digits in 
the roots computed with the quadratic 
formula. Here is an informal proof 
(another approach to estimating the er¬ 
ror in the quadratic formula appears in 
Kahan [1972]). 

If b 2 * 4ac, rounding error can con¬ 
taminate up to half the digits in the roots 
computed with the quadratic formula [ — b 
± Vb 2 — 4ac]/2a. 

Proof. Write (6 <8> b) 0 (4a ® c) = 
(6 2 (1 + Sj) - 4ac(l + <5 2 ))(1 + 5 3 ), where 
| 8 t | < 2e. 22 Using d = b 2 - 4ac, this 
can be rewritten as (c((l + <5 4 ) — 4ac(Sj 
- <5 2 ))(1 + 5 g ). To get an estimate for the 
size of this error, ignore second-order 
terms in bi, in which the case of the 
absolute error is d(b x + b 3 ) - 4ac5 4 , 
where | 5 4 | = | 8 4 — 5 2 I — 2e. Since d < 
4 ac, the first term d(d 4 + b 3 ) can be ig¬ 
nored. To estimate the second term, use 


22 In this informal proof, assume /3 = 2 so multipli¬ 
cation by 4 is exact and does not require a 5 S . 


shows that \/d + 4 c^r 2 6^ = 's/d + E, 
where | E \ < \f4a 2 rf \ b 4 \ , so the abso¬ 
lute error in Vd /2 a is about r 1 1 /bl . 
Since 6 4 ~ £U P , sfbl « /3~ p/2 , and thus 
the absolute error of r 4 \/destroys the 
bottom half of the bits of the roots r 4 ~ 
r 2 . In other words, since the calculation 
of the roots involves computing with 
Vd /2 a and this expression does not have 
meaningful bits in the position corre¬ 
sponding to the lower order half of r t , the 
lower order bits of r t cannot be meaning¬ 
ful. B 

Finally, we turn to the proof of Theo¬ 
rem 6. It is based on the following fact in 
Theorem 14, which is proven in the 
Appendix. 

Theorem 14 

Let 0 < k < p, and set m = /3* + 1, and 
assume floating-point operations are ex¬ 
actly rounded. Then (m®x) © (m<S>x 
© x) is exactly equal to x rounded to 
p — k significant digits. More precisely, x 
is rounded by taking the significand of x, 
imagining a radix point just left of the k 
least significant digits and rounding to 
an integer. 

Proof Theorem 6. By Theorem 14, x h is 
x rounded to p - k = [ p/2 j places. If 
there is no carry out, x h can be repre¬ 
sented with [ p/2\ significant digits. 
Suppose there is a carry out. If x = 
x 0 x t x p _, x [P, rounding adds 1 to 
x p -it -i, the only way there can be a carry 
out is if x p _ k _ 1 = j8 — 1. In that case, 
however, the low-order digit of x h is 1 + 
x P -k-i ~ 0, and so again x h is repre¬ 
sentable in [ pj 2j digits. 
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To deal with x t , scale x to be an inte¬ 
ger satisfying 3 P_1 < x < 3 P - 1. Let x 
= x h + x t , where x h is the p — k high- 
order digits of x and x t is the k low-order 
digits. There are three cases to consider. 
If Xj < (/3/2)(3 k ~ 1 , then rounding x to 
p - k places is the same as chopping and 
x h = x h , and x t = x v Since x t has at 
most k digits, if p is even, then x l has at 
most k = \p/2\ = \ p /2| digits. Other¬ 
wise, (3 = 2 and x t < 2 k ~ 1 is repre¬ 
sentable with k - 1 < [ p /2| significant 
bits. The second case is when x l > 
(3/2)3* -1 ; then computing x h involves 
rounding up, so x h — x h + 3* and x t = 
x - x h = x - x h - 3* = x t ~ P k - Once 
again, x : has at most k digits, so it is 
representable with [ p /2] digits. Finally, 
if x t = (3/2)3* _1 , then x h = x h or x h + 
3 k depending on whether there is a round 
up. Therefore, x t is either (3/2)3 fe_1 or 
(3/2)3* -1 - I3 k = -3V2, both of which 
are represented with 1 digit. ■ 

Theorem 6 gives a way to express the 
product of two single-precision numbers 
exactly as a sum. There is a companion 
formula for expressing a sum exactly. If 
| x | > | y |, then x + y = (x © y) + (x 
© (x © y)) © y [Dekker 1971; Knuth 
1981, Theorem C in Section 4.2.2], When 
using exactly rounded operations, how¬ 
ever, this formula is only true for 3 = 2, 
not for 3 = 10 as the example x = .99998, 
y = .99997 shows. 

4.2 Binary-to-Decimal Conversion 

Since single precision has p = 24 and 
2 24 < 10 s , we might expect that convert¬ 
ing a binary number to eight decimal 
digits would be sufficient to recover the 
original binary number. This is not the 
case, however. 

Theorem 15 

When a binary IEEE single-precision 
number is converted to the closest eight 
digit decimal number, it is not always 
possible to recover the binary number 
uniquely from the decimal one. If nine 
decimal digits are used, however, then 


converting the decimal number to the 
closest binary number will recover the 
original floating-point number. 

Proof. Binary single-precision num¬ 
bers lying in the half-open interval 
[10 3 ,2 10 ) = [1000,1024) have 10 bits to 
the left of the binary point and 14 bits to 
the right of the binary point. Thus, there 
are (2 10 — 10 3 )2 14 = 393,216 different bi¬ 
nary numbers in that interval. If decimal 
numbers are represented with eight dig¬ 
its, there are (2 10 - 10 3 )10 4 = 240,000 
decimal numbers in the same interval. 
There is no way 240,000 decimal num¬ 
bers could represent 393,216 different bi¬ 
nary numbers. So eight decimal digits 
are not enough to represent each single¬ 
precision binary number uniquely. 

To show that nine digits are sufficient, 
it is enough to show that the spacing 
between binary numbers is always 
greater than the spacing between deci¬ 
mal numbers. This will ensure that for 
each decimal number N, the interval [N 
- 1/2 ulp, N + 1/2 ulp] contains at most 
one binary number. Thus, each binary 
number rounds to a unique decimal num¬ 
ber, which in turn rounds to a unique 
binary number. 

To show that the spacing between bi¬ 
nary numbers is always greater than the 
spacing between decimal numbers, con¬ 
sider an interval [10", 10" + 1 ]. On this 
interval, the spacing between consecu¬ 
tive decimal numbers is 10 (n + 1) ~ 9 . On 
[10",2 m ], where m is the smallest inte¬ 
ger so that 10 n <2 m , the spacing of 
binary numbers is 2 m-24 and the spac¬ 
ing gets larger further on in the inter¬ 
val. Thus, it is enough to check that 
10 (n+i)-9 < 2 m ~ 24 . But, in fact, since 
10" < 2 m , then 10 < " + 1> " 9 = 10"10^ 8 < 
2 m 10~ 8 < 2'"2~ 24 . ■ 

The same argument applied to double 
precision shows that 17 decimal digits 
are required to recover a double-precision 
number. 

Binary-decimal conversion also pro¬ 
vides another example of the use of flags. 
Recall from Section 2.1.2 that to recover 
a binary number from its decimal expan- 
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sion, the decimal-to-binary conversion 
must be computed exactly. That conver¬ 
sion is performed by multiplying the 
quantities N and 10 |p| (which are both 
exact if P < 13) in single-extended preci¬ 
sion and then rounding this to single 
precision (or dividing if P < 0; both cases 
are similar). The computation of N ■ 
10 1 cannot be exact; it is the combined 
operation round (N • 10 I ) that must be 
exact, where the rounding is from single 
extended to single precision. To see why 
it might fail to be exact, take the simple 
case of (3 = 10, p = 2 for single and p — 3 
for single extended. If the product is to 
be 12.51, this would be rounded to 12.5 
as part of the single-extended multiply 
operation. Rounding to single precision 
would give 12. But that answer is not 
correct, because rounding the product to 
single precision should give 13. The error 
is a result of double rounding. 

By using the IEEE flags, the double 
rounding can be avoided as follows. Save 
the current value of the inexact flag, then 
reset it. Set the rounding mode to round 
to zero. Then perform the multiplication 
N • 101*'. Store the new value of the 
inexact flag in ixflag, and restore the 
rounding mode and inexact flag. If ixflag 
is 0, then N- 10 |p| is exact, so round 
(N ■ lO'*') will be correct down to the 
last bit. If ixflag is 1, then some digits 
were truncated, since round to zero al¬ 
ways truncates. The significand of the 
product will look like l.b 1 b 22 b 23 
■ ■ ■ b 31 . A double-rounding error may oc¬ 
cur if b 23 • ■ • b 31 = 10 • • • 0. A simple 
way to account for both cases is to per¬ 
form a logical or of ixflag with b 31 . Then 
round (N ■ 10 |p| ) will be computed 
correctly in all cases. 

4.3 Errors in Summation 

Section 3.2.3 mentioned the problem of 
accurately computing very long sums. 
The simplest approach to improving ac¬ 
curacy is to double the precision. To get a 
rough estimate of how much doubling 
the precision improves the accuracy of a 
sum, let s x = x x , s 2 = s x ® x 2 ,. . ., s t = 
s i-i ® *r Then s t = (1 + b t )(s l _ 1 + x t ). 


where | 5, | < e, and ignoring second- 
order terms in oi gives 

7-1 \ *-7 / 

= £>,- + (31) 

7=1 7=1 \ft=7 / 

The first equality of (31) shows that 
the computed value of Y.x - is the same as 
if an exact summation was performed on 
perturbed values of x . The first term x, 
is perturbed by ne, the last term x n by 
only e. The second equality in (31) shows 
that error term is bounded by ne £ | x \. 
Doubling the precision has the effect of 
squaring e. If the sum is being done in 
an IEEE double-precision format, 1/e = 
10 16 , so that ne <1 for any reasonable 
value of n. Thus, doubling the precision 
takes the maximum perturbation of ne 
and changes it to ne 2 < e. Thus the 2e 
error bound for the Kahan summation 
formula (Theorem 8) is not as good as 
using double precision, even though it is 
much better than single precision. 

For an intuitive explanation of why 
the Kahan summation formula works, 
consider the following diagram of proce¬ 
dure: 



+ 




Y h 




= C 
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Each time a summand is added, there 
is a correction factor C that will be ap¬ 
plied on the next loop. So first subtract 
the correction C computed in the previ¬ 
ous loop from Xj, giving the corrected 
summand Y. Then add this summand to 
the running sum S'. The low-order bits of 
Y (namely, Y { ) are lost in the sum. Next, 
compute the high-order bits of Y by com¬ 
puting T - S. When Y is subtracted from 
this, the low-order bits of Y will be re¬ 
covered. These are the bits that were lost 
in the first sum in the diagram. They 
become the correction factor for the next 
loop. A formal proof of Theorem 8, taken 
from Knuth [1981] page 572, appears in 
the Appendix. 

5. SUMMARY 

It is not uncommon for computer system 
designers to neglect the parts of a system 
related to floating point. This is probably 
due to the fact that floating point is given 
little, if any, attention in the computer 
science curriculum. This in turn has 
caused the apparently widespread belief 
that floating point is not a quantifiable 
subject, so there is little point in fussing 
over the details of hardware and soft¬ 
ware that deal with it. 

This paper has demonstrated that it is 
possible to reason rigorously about float¬ 
ing point. For example, floating-point al¬ 
gorithms involving cancellation can be 
proven to have small relative errors if 
the underlying hardware has a guard 
digit and there is an efficient algorithm 
for binary-decimal conversion that can be 
proven to be invertible, provided ex¬ 
tended precision is supported. The task 
of constructing reliable floating-point 
software is made easier when the under¬ 
lying computer system is supportive of 
floating point. In addition to the two 
examples just mentioned (guard digits 
and extended precision), Section 3 of 
this paper has examples ranging from 
instruction set design to compiler opt¬ 
imization illustrating how to better 
support floating point. 

The increasing acceptance of the IEEE 
floating-point standard means that codes 


that use features of the standard are be¬ 
coming ever more portable. Section 2 
gave numerous examples illustrating 
how the features of the IEEE standard 
can be used in writing practical floating¬ 
point codes. 

APPENDIX 

This Appendix contains two technical 
proofs omitted from the text. 

Theorem 14 

Let 0 < k < p, set m = (5 k + 1, and as¬ 
sume floating-point operations are exactly 
rounded. Then (m®x) © (m <8> x © x) 
is exactly equal to x rounded to p — k 
significant digits. More precisely, x is 
rounded by taking the significand of x, 
imagining a radix point just left of the k 
least-significant digits, and rounding to 
an integer. 

Proof. The proof breaks up into two 
cases, depending on whether or not the 
computation of mx = (3 k x + x has a carry 
out or not. 

Assume there is no carry out. It is 
harmless to scale x so that it is an inte¬ 
ger. Then the computation of mx = x + 
0 k x looks like this: 

aa • ••aabb • • • bb 
aa • ••aabb ■ • • bb 
zz • - • zzbb • • ■ bb 

where x has been partitioned into two 
parts. The low-order k digits are marked 
b and the high-order p - k digits are 
marked a. To compute m 0 x from mx 
involves rounding off the low-order k 
digits (the ones marked with b) so 

m <8> x = mx — x mod(|8 fc ) + r/3 k . (32) 

The value of r is 1 if .bb • • • b is greater 
than 1/2 and 0 otherwise. More pre¬ 
cisely, 

r = 1 if a.bb • • • b rounds to a + 1, 

r = 0 otherwise. (33) 
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Next compute 

m <8) x — x = mx — x mod(/3 k ) + r/ 3 k - x 
= j3 k (x + r) - xmod(/3 A ). 

The picture below shows the computation 
of m®x — x rounded, that is, (to<S)x) 
© x. The top line is /3 k (x + r), where B 
is the digit that results from adding r to 
the lowest order digit b: 

aa • • • aabb • • • bB 00 • • • 00 

-bb • • • bb _ 

zz • • • zzZOO • • • 00 

If .bb ■ • • b < 1/2, then r - 0. Subtract¬ 
ing causes a borrow from the digit 
marked B, but the difference is rounded 
up, so the net effect is that the rounded 
difference equals the top line, which is 
f k x. If .bb • • • b > 1/2, then r = 1, and 1 
is subtracted from B because of the bor¬ 
row. So again the result is (3 k x. Finally, 
consider the case .bb • • - b = 1/2. If r = 
0, then B is even, Z is odd, and the 
difference is rounded up giving dis¬ 
similarly, when r = 1, B is odd, Z is 
even, the difference is rounded down, so 
again the difference is (3 k x. To summa¬ 
rize, 

(m <8> x) © x = (3 k x. (34) 

Combining eqs. (32) and (34) gives 
(m <8> x) - (m <8> x © x) = x — 
xmod(di) + r(3 k . The result of perform¬ 
ing this computation is 

r 00 • • • 00 
aa • • • aabb • • • bb 
| -bb - bb 
aa•• • aaOO•••00. 

The rule for computing r, eq. (33), is the 
same as the rule for rounding a - 
ab • • • b to p - k places. Thus, comput¬ 
ing mx — (mx — x) in floating-point 
arithmetic precision is exactly equal to 
rounding x to p — k places, in the case 
when x + I3 k x does not carry out. 

When x + I3 k x does carry out, mx = 
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g k x + x looks like this: 

aa • ■ • aabb • ■• bb 
aa • • • aabb • ■ • bb 
zzz • • • zZbb • • • bb 

Thus, m <8> x - mx - x modld^) + wf3 k , 
where w = -Z if Z < 13/2, but the exact 
value of w in unimportant. Next m ® x 
- x = fi k x - x mod((3 k ) + wf3 k ■ In a pic¬ 
ture 

aa • •• aabb • • • bb00 • • • 00 
- bb • •• bb 

+ w_ 


Rounding gives (m <8> x) © x = $ k x + 
wj3 k — r(3 k , where r = 1 if .bb • • • b > 
1/2 or if .bb • • • b = 1/2 and b 0 = 1. Fi¬ 
nally, 

(m®x) — (m <8> x © x) 

= mx - x mod(/3*) + w$ k 

- (fi k x + wl3 k - r(3 k ) 

= x - x mod(/3^) + r(3 k . 

Once again, r - 1 exactly when rounding 
a • • • ab • • • b to p - k places involves 
rounding up. Thus, Theorem 14 is proven 
in all cases. ■ 

Theorem 8 (Kahan Summation Formula) 

Suppose vfl i Xj is computed using the 
following algorithm 

S = X[l] 

C = 0 

for j = 2 to N { 

Y = Xlj] - C 
T = S + Y 
C = (T - S) - Y 
S = T 

} 

Then the computed sum S is equal to 
S = E*,(l + &j) + 0(IVe 2 )E \ Xj\, where 
|5,| <2e. 

Proof. First recall how the error esti¬ 
mate for the simple formula Y.x : went. 
Introduce s x = jc 1( s t = (1 + 5 l )(s l _ 1 - 1 
+ x t ). Then the computed sum is s n , 
which is a sum of terms, each of which 
is an x, multiplied by an expression 
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involving 6 J ’s. The exact coefficient of x 1 
is (1 + S 2 ) (1 + S 3 ) ■ ■ ■ (1 + 8 n ). Therefore 
by renumbering, the coefficient of x 2 
must be (1 + 5 3 )(1 + S 4 ) ■ ■ ■ (1 + 5„), and 
so on. The proof of Theorem 8 runs along 
exactly the same lines, only the coeffi¬ 
cients of x x is more complicated. In de¬ 
tail s 0 = c 0 = 0 and 

y k = x k 0 c*_ x = (x k - C*_j)(l + Vk ) 

s k = s k-i ® yk = i s k~i + tOO + °k) 
c k = (s k © s*_0 © y k 

= [( s * ~ s^-iM 1 + tO - Vk] (i + 50 

where all the Greek letters are bounded 
by e. Although the coefficient of x, in s k 
is the ultimate expression of interest, it 
turns out to be easier to compute the 
coefficient of x : in s k - c k and c k . When 
k = 1, 

Ci = (sj(l + - yj)(l + 5 X ) 

= ^((1 + ff x )( 1 + - 1)(1 + 6 X ) 

= x i( a i + 7i + °VYi) 

(1 + 60(1 + ,!) 

s i ~ c i = *i[(l + CT 0 _ ( ff i + 7i + oyvO 

(1 + 60] (1 + Vl ) 

= xjl - 7 X - - o 1 y 1 

~ S i7i ~ ff i70i](l + Vi)- 

Calling the coefficients of x x in these 
expressions C k and S k , respectively, then 

C x = 2e + 0(e 2 ) 

Sj = 1 + ,! - 7 X + 4e 2 + 0(e 3 ). 

To get the general formula for S k and 
C k , expand the definitions of s k and c k , 
ignoring all terms involving x t with 
i > 1. That gives 

s k - i s k~i + yu){ 1 + °0 

= K-i + (** - c k-i){l + i?*)] 

(! + a k ) 

= [( S A-1 - C *-l) ~ Vk C k-l] (l + a k) 


c k - [{ S A “ s A-i}(l + 7k) - 7k] (1 + &k) 
= [{(( s *-i - c k-i) - %<T_0(1 + o k ) 

_s A-i}(l + 7k) + c a-i(! + Vk)\ 

(1 + 5 0 

= [{(s*-l - C k -l)°k - Vk^k- l(l + Ok) 
_c *-i}(l + 7k) + <T-i(l + Vk)] 

(1 + 50 

= [(-1 “ c k-i) a k {1 + 7k) 

~ c k-i(7k + V k (°k + 7k + <7*7*))] 

(i + 5 0 

s k ~ c k 

- (( s a-i ~ c k-i) ~ Vk c k-i) 

(1 + a k) 

- [( S A-1 - c k-i)o k {l + 7k) 
~ c k-i(7k + Vk(°k + 7k + a k7k))] 

(1 + 50 

= ( s £-i ~ c fe- i)(( i + ct 0 

-^(1 + yOO + 5 *)) 

+ Ck-ii-Vki 1 + a k) 

+ (y* + Vk(°k + 7k + °k7k)) 

0 + s k)) 

= ( S k-1 - c k _ x ) 

(i - °k(y k + 5 fe + 7*50) 

+ c k-i[~Vk + 7k 

+v k {y k + a k7k) 

+ (y* + Vk(°k + 7k + o k y k ))b k \ 

Since S k and C k are only being computed 
up to order e 2 , these formulas can be 
simplified to 

C A = (a* + 0(e 2 ))S,„! 

+ {~7k + 0(e 2 ))C k _ 1 
«*=((!+2 e 2 + 0(e 3 ))S,_ 1 
+ (2e + 0(e 2 ))C k _ 1 . 
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Using these formulas gives 
C 2 = a 2 + 0(e 2 ) 

&2 = 1 + Vi ~ 7i + 10e 2 + 0(e 3 ), 

and, in general, it is easy to check by 
induction that 

C k = o k + 0(e 2 ) 

S k =l + r tl -y 1 + (4k + 2)e 2 + 0(e 3 ). 

Finally, what is wanted is the coeffi¬ 
cient of in s k . To get this value, let 
x n+l = 0, let all the Greek letters with 
subscripts of n + 1 equal 0 and compute 
s ra+1 . Then s ra+1 = s n - c n , and the coef¬ 
ficient of x x in s n is less than the coeffi¬ 
cient in s n+1 , which is S n = 1 + ri 1 — y 1 
+ (4 n + 2)e 2 = 1 + 2e + 0(ne 2 ). M 
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